mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

BKDWriter could make splitting decisions based on the actual range of values [LUCENE-8928] #925

Closed mikemccand closed 5 years ago

mikemccand commented 5 years ago

Currently BKDWriter assumes that splitting on one dimension has no effect on values in other dimensions. While this may be ok for geo points, this is usually not true for ranges (or geo shapes, which are ranges too). Maybe we could get better indexing by re-computing the range of values on each dimension before making the choice of the split dimension?


Legacy Jira details

LUCENE-8928 by Adrien Grand (@jpountz) on Jul 19 2019, resolved Oct 16 2019

mikemccand commented 5 years ago

I played with this idea a bit at https://github.com/jpountz/lucene-solr/commit/16e6594af44b753c9ac498a063eb9b9d6102e020 and https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/IndexAndSearchOpenStreetMaps.java with shapes. It's a bit artificial since we are using shapes to index points, but nevertheless I got 62% slower indexing (130 seconds instead of 80) but 45% faster searching for box queries (63.0 QPS instead of 43.5).

[Legacy Jira: Adrien Grand (@jpountz) on Jul 19 2019]

mikemccand commented 5 years ago

I run this approach locally. It helps as well in the case of Geo3D (3 dimensions case) quite a bit. I tried different approaches to try to make indexation faster but so far no luck:

 

Approach Index time (sec) Index time (sec) Force merge time (sec) Force merge time (sec) Index size (GB) Index size (GB) Reader heap (MB) Reader heap (MB)
  Dev Base Diff Dev Base diff Dev Base Diff Dev Base Diff
points 181.1s 124.4s 46% 76.9s 53.5s 44% 0.55 0.55 -0% 1.57 1.57 0%
shapes 327.4s 215.4s 52% 168.9s 120.2s 40% 1.28 1.29 -1% 1.62 1.61 0%
geo3d 211.9s 154.7s 37% 94.3s 66.4s 42% 0.75 0.75 -0% 1.58 1.58 0%

 

Approach Shape M hits/sec M hits/sec QPS QPS Hit count Hit count

|| || ||Dev||Base ||Diff||Dev||Base||Diff||Dev||Base||Diff||

points box 94.34 94.84 -1% 95.99 96.50 -1% 221118844 221118844 0%
points polyRussia 20.07 20.46 -2% 5.72 5.83 -2% 3508846 3508846 0%
points poly 10 88.64 87.56 1% 56.05 55.37 1% 355809475 355809475 0%
points polyMedium 10.47 10.54 -1% 128.26 129.15 -1% 2693559 2693559 0%
points distance 93.48 95.96 -3% 54.92 56.38 -3% 382961957 382961957 0%
points nearest 10 0.10 0.09 11% 9687.24 8755.72 11% 60844404 60844404 0%
points sort 43.12 43.04 0% 43.88 43.80 0% 221118844 221118844 0%
shapes box 66.02 52.23 26% 67.18 53.15 26% 221118844 221118844 0%
shapes polyRussia 11.57 9.85 17% 3.30 2.81 17% 3508846 3508846 0%
shapes poly 10 54.98 47.08 17% 34.77 29.77 17% 355809475 355809475 0%
shapes polyMedium 5.31 4.52 17% 65.01 55.39 17% 2693559 2693559 0%
geo3d box 79.17 66.22 20% 80.56 67.38 20% 221118844 221118844 0%
geo3d polyRussia 0.95 0.90 5% 0.27 0.26 5% 3508671 3508671 0%
geo3d poly 10 77.26 57.16 35% 48.85 36.14 35% 355855227 355855227 0%
geo3d polyMedium 0.95 0.69 37% 11.62 8.50 37% 2693545 2693545 0%
geo3d distance 95.35 76.17 25% 55.96 44.70 25% 383371884 383371884 0%

 

[Legacy Jira: Ignacio Vera (@iverase) on Jul 22 2019 [updated: Sep 24 2019]]

mikemccand commented 5 years ago

Seems like it should only happen when number of dimensions is large?

[Legacy Jira: Robert Muir (@rmuir) on Jul 22 2019]

mikemccand commented 5 years ago

I would expect some increase on QPS for 2D when there is correlation between the dimensions, e.g range fields.

[Legacy Jira: Ignacio Vera (@iverase) on Jul 22 2019]

mikemccand commented 5 years ago

Do you mean 1-dimensional ranges (actually 2D) or 2-dimensional ranges (actually 4D)

[Legacy Jira: Robert Muir (@rmuir) on Jul 22 2019]

mikemccand commented 5 years ago

1-dimensional ranges (2D in total). 

In fact the shapes benchmark above is a 2-dimensional rage (4D in total).

 

[Legacy Jira: Ignacio Vera (@iverase) on Jul 22 2019]

mikemccand commented 5 years ago

I guess I'm saying, I don't see any evidence of speedups for 2 dimensions above. For > 2 dimensions, the indexing cost is worth the increased search performance, but for 2 dimensions it isn't the right tradeoff.

[Legacy Jira: Robert Muir (@rmuir) on Jul 22 2019]

mikemccand commented 5 years ago

I tried to see the effect running on 1D ranges and it is the same as above. So +1 to apply this change only when numDims > 2 as it seems the right trade off.

[Legacy Jira: Ignacio Vera (@iverase) on Jul 26 2019]

mikemccand commented 5 years ago

I have played a bit more with this idea and I wondered if we need to compute exact bounds for every split. I modified @jpountz  patch so instead of computing the bounds for every split, it computes every N splits. This is controlled by a static property called SPLITS_BEFORE_EXACT_BOUNDS.

The patch can be found here: https://github.com/iverase/lucene-solr/commit/e63f8c73a86c46ec406143fcd0cb31a8371dfe63

My test show that setting this value to 4 (compute exact bounds every 4 splits) reduces the indexing overhead to around 10% and keeps almost the same performance as the previous approach. Maybe we can find a better heuristic to set such value.

In addition, this patch does not apply for dimension <= 2 and the split algorithm is reverted to the original one.

 

[Legacy Jira: Ignacio Vera (@iverase) on Sep 24 2019]

mikemccand commented 5 years ago

Run some benchmarks by comparing this new approach with the previous approach shown a similar query performance but a much faster indexing rate:

Approach Index time (sec) Index time (sec) Force merge time (sec) Force merge time (sec) Index size (GB) Index size (GB) Reader heap (MB) Reader heap (MB)
  Dev Base Diff Dev Base diff Dev Base Diff Dev Base Diff
geo3d 163.5s 218.4s -25% 0.0s 0.0s 0% 0.71 0.71 -0% 1.75 1.75 -0%
shapes 227.8s 319.6s -29% 0.0s 0.0s 0% 1.27 1.27 0% 1.78 1.78 0%
Approach Shape M hits/sec M hits/sec QPS QPS Hit count Hit count

|| || ||Dev||Base ||Diff||Dev||Base||Diff||Dev||Base||Diff||

geo3d box 55.58 57.53 -3% 56.56 58.54 -3% 221118844 221118844 0%
geo3d polyRussia 0.56 0.56 -1% 0.16 0.16 -1% 3508671 3508671 0%
geo3d poly 10 48.87 51.25 -5% 30.90 32.41 -5% 355855227 355855227 0%
geo3d polyMedium 0.62 0.63 -1% 7.64 7.67 -1% 2693545 2693545 0%
geo3d distance 68.16 69.70 -2% 40.00 40.91 -2% 383371884 383371884 0%
shapes box 45.99 46.52 -1% 46.80 47.34 -1% 221118844 221118844 0%
shapes polyRussia 6.64 7.01 -5% 1.89 2.00 -5% 3508846 3508846 0%
shapes poly 10 33.40 34.69 -4% 21.12 21.93 -4% 355809475 355809475 0%
shapes polyMedium 3.07 3.30 -7% 37.62 40.43 -7% 2693559 2693559 0%

[Legacy Jira: Ignacio Vera (@iverase) on Sep 26 2019]

mikemccand commented 5 years ago

I haven't looked at the code, but based on your description this looks interesting. Maybe open a PR?

[Legacy Jira: Adrien Grand (@jpountz) on Oct 04 2019]

mikemccand commented 5 years ago

Commit 0295e281d0498bdcd58ab023ad6dd9d8804a0815 in lucene-solr's branch refs/heads/master from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0295e28

LUCENE-8928: Compute exact bounds every N splits (#926)

When building a kd-tree for dimensions n > 2, compute exact bounds for an inner node every N splits to improve the quality of the tree. N is defined by SPLITS_BEFORE_EXACT_BOUNDS which is set to 4.

[Legacy Jira: ASF subversion and git services on Oct 11 2019]

mikemccand commented 5 years ago

Commit a9c77504023b3f1e0b81dbe52537fa19f4586200 in lucene-solr's branch refs/heads/branch_8x from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a9c7750

LUCENE-8928: Compute exact bounds every N splits (#926)

When building a kd-tree for dimensions n > 2, compute exact bounds for an inner node every N splits to improve the quality of the tree. N is defined by SPLITS_BEFORE_EXACT_BOUNDS which is set to 4.

[Legacy Jira: ASF subversion and git services on Oct 11 2019]

mikemccand commented 5 years ago

Commit 0295e281d0498bdcd58ab023ad6dd9d8804a0815 in lucene-solr's branch refs/heads/jira/SOLR-13731 from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0295e28

LUCENE-8928: Compute exact bounds every N splits (#926)

When building a kd-tree for dimensions n > 2, compute exact bounds for an inner node every N splits to improve the quality of the tree. N is defined by SPLITS_BEFORE_EXACT_BOUNDS which is set to 4.

[Legacy Jira: ASF subversion and git services on Oct 14 2019]

mikemccand commented 5 years ago

The nightly benchmarks liked this change http://people.apache.org/~mikemccand/geobench.html

However

[Legacy Jira: Adrien Grand (@jpountz) on Oct 14 2019]

mikemccand commented 5 years ago

@daddywri The above might be interesting to you.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 14 2019]

mikemccand commented 5 years ago

Wow :)  Awesome!

[Legacy Jira: Michael McCandless (@mikemccand) on Oct 14 2019]

mikemccand commented 5 years ago

Commit 1d97e25a8a77c44d60ea5350b344cce1e48dcc71 in lucene-solr's branch refs/heads/master from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1d97e25

LUCENE-8928: Check that point is inside an edge bounding box when checking if the point belongs to the edge

[Legacy Jira: ASF subversion and git services on Oct 14 2019]

mikemccand commented 5 years ago

Commit e719ea5a42c2b4ec4bec2668056708e910f85e05 in lucene-solr's branch refs/heads/branch_8x from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e719ea5

LUCENE-8928: Check that point is inside an edge bounding box when checking if the point belongs to the edge

[Legacy Jira: ASF subversion and git services on Oct 14 2019]

mikemccand commented 5 years ago

I'm curious whether that also helps the 3D field you use to filter deals @mikemccand @msokolov.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 15 2019]

mikemccand commented 5 years ago

I'm curious whether that also helps the 3D field you use to filter deals

Yes, it will be interesting to see! It might take us a little while, and there tend to be more "real" deals around certain times of year ...

[Legacy Jira: Michael Sokolov (@msokolov) on Oct 15 2019]

mikemccand commented 4 years ago

Closing after the 8.4.0 release.

[Legacy Jira: Adrien Grand (@jpountz) on Dec 29 2019]