Closed mikemccand closed 5 years ago
I played with this idea a bit at https://github.com/jpountz/lucene-solr/commit/16e6594af44b753c9ac498a063eb9b9d6102e020 and https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/IndexAndSearchOpenStreetMaps.java with shapes. It's a bit artificial since we are using shapes to index points, but nevertheless I got 62% slower indexing (130 seconds instead of 80) but 45% faster searching for box queries (63.0 QPS instead of 43.5).
[Legacy Jira: Adrien Grand (@jpountz) on Jul 19 2019]
I run this approach locally. It helps as well in the case of Geo3D (3 dimensions case) quite a bit. I tried different approaches to try to make indexation faster but so far no luck:
Approach | Index time (sec) | Index time (sec) | Force merge time (sec) | Force merge time (sec) | Index size (GB) | Index size (GB) | Reader heap (MB) | Reader heap (MB) | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dev | Base | Diff | Dev | Base | diff | Dev | Base | Diff | Dev | Base | Diff | |
points | 181.1s | 124.4s | 46% | 76.9s | 53.5s | 44% | 0.55 | 0.55 | -0% | 1.57 | 1.57 | 0% |
shapes | 327.4s | 215.4s | 52% | 168.9s | 120.2s | 40% | 1.28 | 1.29 | -1% | 1.62 | 1.61 | 0% |
geo3d | 211.9s | 154.7s | 37% | 94.3s | 66.4s | 42% | 0.75 | 0.75 | -0% | 1.58 | 1.58 | 0% |
Approach | Shape | M hits/sec | M hits/sec | QPS | QPS | Hit count | Hit count |
---|
|| || ||Dev||Base ||Diff||Dev||Base||Diff||Dev||Base||Diff||
points | box | 94.34 | 94.84 | -1% | 95.99 | 96.50 | -1% | 221118844 | 221118844 | 0% |
---|---|---|---|---|---|---|---|---|---|---|
points | polyRussia | 20.07 | 20.46 | -2% | 5.72 | 5.83 | -2% | 3508846 | 3508846 | 0% |
points | poly 10 | 88.64 | 87.56 | 1% | 56.05 | 55.37 | 1% | 355809475 | 355809475 | 0% |
points | polyMedium | 10.47 | 10.54 | -1% | 128.26 | 129.15 | -1% | 2693559 | 2693559 | 0% |
points | distance | 93.48 | 95.96 | -3% | 54.92 | 56.38 | -3% | 382961957 | 382961957 | 0% |
points | nearest 10 | 0.10 | 0.09 | 11% | 9687.24 | 8755.72 | 11% | 60844404 | 60844404 | 0% |
points | sort | 43.12 | 43.04 | 0% | 43.88 | 43.80 | 0% | 221118844 | 221118844 | 0% |
shapes | box | 66.02 | 52.23 | 26% | 67.18 | 53.15 | 26% | 221118844 | 221118844 | 0% |
shapes | polyRussia | 11.57 | 9.85 | 17% | 3.30 | 2.81 | 17% | 3508846 | 3508846 | 0% |
shapes | poly 10 | 54.98 | 47.08 | 17% | 34.77 | 29.77 | 17% | 355809475 | 355809475 | 0% |
shapes | polyMedium | 5.31 | 4.52 | 17% | 65.01 | 55.39 | 17% | 2693559 | 2693559 | 0% |
geo3d | box | 79.17 | 66.22 | 20% | 80.56 | 67.38 | 20% | 221118844 | 221118844 | 0% |
geo3d | polyRussia | 0.95 | 0.90 | 5% | 0.27 | 0.26 | 5% | 3508671 | 3508671 | 0% |
geo3d | poly 10 | 77.26 | 57.16 | 35% | 48.85 | 36.14 | 35% | 355855227 | 355855227 | 0% |
geo3d | polyMedium | 0.95 | 0.69 | 37% | 11.62 | 8.50 | 37% | 2693545 | 2693545 | 0% |
geo3d | distance | 95.35 | 76.17 | 25% | 55.96 | 44.70 | 25% | 383371884 | 383371884 | 0% |
[Legacy Jira: Ignacio Vera (@iverase) on Jul 22 2019 [updated: Sep 24 2019]]
Seems like it should only happen when number of dimensions is large?
[Legacy Jira: Robert Muir (@rmuir) on Jul 22 2019]
I would expect some increase on QPS for 2D when there is correlation between the dimensions, e.g range fields.
[Legacy Jira: Ignacio Vera (@iverase) on Jul 22 2019]
Do you mean 1-dimensional ranges (actually 2D) or 2-dimensional ranges (actually 4D)
[Legacy Jira: Robert Muir (@rmuir) on Jul 22 2019]
1-dimensional ranges (2D in total).
In fact the shapes benchmark above is a 2-dimensional rage (4D in total).
[Legacy Jira: Ignacio Vera (@iverase) on Jul 22 2019]
I guess I'm saying, I don't see any evidence of speedups for 2 dimensions above. For > 2 dimensions, the indexing cost is worth the increased search performance, but for 2 dimensions it isn't the right tradeoff.
[Legacy Jira: Robert Muir (@rmuir) on Jul 22 2019]
I tried to see the effect running on 1D ranges and it is the same as above. So +1 to apply this change only when numDims > 2 as it seems the right trade off.
[Legacy Jira: Ignacio Vera (@iverase) on Jul 26 2019]
I have played a bit more with this idea and I wondered if we need to compute exact bounds for every split. I modified @jpountz patch so instead of computing the bounds for every split, it computes every N splits. This is controlled by a static property called SPLITS_BEFORE_EXACT_BOUNDS
.
The patch can be found here: https://github.com/iverase/lucene-solr/commit/e63f8c73a86c46ec406143fcd0cb31a8371dfe63
My test show that setting this value to 4 (compute exact bounds every 4 splits) reduces the indexing overhead to around 10% and keeps almost the same performance as the previous approach. Maybe we can find a better heuristic to set such value.
In addition, this patch does not apply for dimension <= 2 and the split algorithm is reverted to the original one.
[Legacy Jira: Ignacio Vera (@iverase) on Sep 24 2019]
Run some benchmarks by comparing this new approach with the previous approach shown a similar query performance but a much faster indexing rate:
Approach | Index time (sec) | Index time (sec) | Force merge time (sec) | Force merge time (sec) | Index size (GB) | Index size (GB) | Reader heap (MB) | Reader heap (MB) | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dev | Base | Diff | Dev | Base | diff | Dev | Base | Diff | Dev | Base | Diff | |
geo3d | 163.5s | 218.4s | -25% | 0.0s | 0.0s | 0% | 0.71 | 0.71 | -0% | 1.75 | 1.75 | -0% |
shapes | 227.8s | 319.6s | -29% | 0.0s | 0.0s | 0% | 1.27 | 1.27 | 0% | 1.78 | 1.78 | 0% |
Approach | Shape | M hits/sec | M hits/sec | QPS | QPS | Hit count | Hit count |
---|
|| || ||Dev||Base ||Diff||Dev||Base||Diff||Dev||Base||Diff||
geo3d | box | 55.58 | 57.53 | -3% | 56.56 | 58.54 | -3% | 221118844 | 221118844 | 0% |
---|---|---|---|---|---|---|---|---|---|---|
geo3d | polyRussia | 0.56 | 0.56 | -1% | 0.16 | 0.16 | -1% | 3508671 | 3508671 | 0% |
geo3d | poly 10 | 48.87 | 51.25 | -5% | 30.90 | 32.41 | -5% | 355855227 | 355855227 | 0% |
geo3d | polyMedium | 0.62 | 0.63 | -1% | 7.64 | 7.67 | -1% | 2693545 | 2693545 | 0% |
geo3d | distance | 68.16 | 69.70 | -2% | 40.00 | 40.91 | -2% | 383371884 | 383371884 | 0% |
shapes | box | 45.99 | 46.52 | -1% | 46.80 | 47.34 | -1% | 221118844 | 221118844 | 0% |
shapes | polyRussia | 6.64 | 7.01 | -5% | 1.89 | 2.00 | -5% | 3508846 | 3508846 | 0% |
shapes | poly 10 | 33.40 | 34.69 | -4% | 21.12 | 21.93 | -4% | 355809475 | 355809475 | 0% |
shapes | polyMedium | 3.07 | 3.30 | -7% | 37.62 | 40.43 | -7% | 2693559 | 2693559 | 0% |
[Legacy Jira: Ignacio Vera (@iverase) on Sep 26 2019]
I haven't looked at the code, but based on your description this looks interesting. Maybe open a PR?
[Legacy Jira: Adrien Grand (@jpountz) on Oct 04 2019]
Commit 0295e281d0498bdcd58ab023ad6dd9d8804a0815 in lucene-solr's branch refs/heads/master from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0295e28
LUCENE-8928: Compute exact bounds every N splits (#926)
When building a kd-tree for dimensions n > 2, compute exact bounds for an inner node every N splits to improve the quality of the tree. N is defined by SPLITS_BEFORE_EXACT_BOUNDS which is set to 4.
[Legacy Jira: ASF subversion and git services on Oct 11 2019]
Commit a9c77504023b3f1e0b81dbe52537fa19f4586200 in lucene-solr's branch refs/heads/branch_8x from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a9c7750
LUCENE-8928: Compute exact bounds every N splits (#926)
When building a kd-tree for dimensions n > 2, compute exact bounds for an inner node every N splits to improve the quality of the tree. N is defined by SPLITS_BEFORE_EXACT_BOUNDS which is set to 4.
[Legacy Jira: ASF subversion and git services on Oct 11 2019]
Commit 0295e281d0498bdcd58ab023ad6dd9d8804a0815 in lucene-solr's branch refs/heads/jira/SOLR-13731 from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0295e28
LUCENE-8928: Compute exact bounds every N splits (#926)
When building a kd-tree for dimensions n > 2, compute exact bounds for an inner node every N splits to improve the quality of the tree. N is defined by SPLITS_BEFORE_EXACT_BOUNDS which is set to 4.
[Legacy Jira: ASF subversion and git services on Oct 14 2019]
The nightly benchmarks liked this change http://people.apache.org/~mikemccand/geobench.html
However
[Legacy Jira: Adrien Grand (@jpountz) on Oct 14 2019]
@daddywri The above might be interesting to you.
[Legacy Jira: Adrien Grand (@jpountz) on Oct 14 2019]
Wow :) Awesome!
[Legacy Jira: Michael McCandless (@mikemccand) on Oct 14 2019]
Commit 1d97e25a8a77c44d60ea5350b344cce1e48dcc71 in lucene-solr's branch refs/heads/master from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1d97e25
LUCENE-8928: Check that point is inside an edge bounding box when checking if the point belongs to the edge
[Legacy Jira: ASF subversion and git services on Oct 14 2019]
Commit e719ea5a42c2b4ec4bec2668056708e910f85e05 in lucene-solr's branch refs/heads/branch_8x from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e719ea5
LUCENE-8928: Check that point is inside an edge bounding box when checking if the point belongs to the edge
[Legacy Jira: ASF subversion and git services on Oct 14 2019]
I'm curious whether that also helps the 3D field you use to filter deals @mikemccand @msokolov.
[Legacy Jira: Adrien Grand (@jpountz) on Oct 15 2019]
I'm curious whether that also helps the 3D field you use to filter deals
Yes, it will be interesting to see! It might take us a little while, and there tend to be more "real" deals around certain times of year ...
[Legacy Jira: Michael Sokolov (@msokolov) on Oct 15 2019]
Closing after the 8.4.0 release.
[Legacy Jira: Adrien Grand (@jpountz) on Dec 29 2019]
Currently BKDWriter assumes that splitting on one dimension has no effect on values in other dimensions. While this may be ok for geo points, this is usually not true for ranges (or geo shapes, which are ranges too). Maybe we could get better indexing by re-computing the range of values on each dimension before making the choice of the split dimension?
Legacy Jira details
LUCENE-8928 by Adrien Grand (@jpountz) on Jul 19 2019, resolved Oct 16 2019