BKDWriter could make splitting decisions based on the actual range of values [LUCENE-8928]

mikemccand commented 5 years ago

Currently BKDWriter assumes that splitting on one dimension has no effect on values in other dimensions. While this may be ok for geo points, this is usually not true for ranges (or geo shapes, which are ranges too). Maybe we could get better indexing by re-computing the range of values on each dimension before making the choice of the split dimension?

Legacy Jira details

LUCENE-8928 by Adrien Grand (@jpountz) on Jul 19 2019, resolved Oct 16 2019

mikemccand commented 5 years ago

I played with this idea a bit at https://github.com/jpountz/lucene-solr/commit/16e6594af44b753c9ac498a063eb9b9d6102e020 and https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/IndexAndSearchOpenStreetMaps.java with shapes. It's a bit artificial since we are using shapes to index points, but nevertheless I got 62% slower indexing (130 seconds instead of 80) but 45% faster searching for box queries (63.0 QPS instead of 43.5).

[Legacy Jira: Adrien Grand (@jpountz) on Jul 19 2019]

mikemccand commented 5 years ago

I run this approach locally. It helps as well in the case of Geo3D (3 dimensions case) quite a bit. I tried different approaches to try to make indexation faster but so far no luck:

Approach	Index time (sec)	Index time (sec)		Force merge time (sec)	Force merge time (sec)		Index size (GB)	Index size (GB)		Reader heap (MB)	Reader heap (MB)
	Dev	Base	Diff	Dev	Base	diff	Dev	Base	Diff	Dev	Base	Diff
points	181.1s	124.4s	46%	76.9s	53.5s	44%	0.55	0.55	-0%	1.57	1.57	0%
shapes	327.4s	215.4s	52%	168.9s	120.2s	40%	1.28	1.29	-1%	1.62	1.61	0%
geo3d	211.9s	154.7s	37%	94.3s	66.4s	42%	0.75	0.75	-0%	1.58	1.58	0%

Approach	Shape	M hits/sec	M hits/sec		QPS	QPS		Hit count	Hit count

|| || ||Dev||Base ||Diff||Dev||Base||Diff||Dev||Base||Diff||

points	box	94.34	94.84	-1%	95.99	96.50	-1%	221118844	221118844	0%
points	polyRussia	20.07	20.46	-2%	5.72	5.83	-2%	3508846	3508846	0%
points	poly 10	88.64	87.56	1%	56.05	55.37	1%	355809475	355809475	0%
points	polyMedium	10.47	10.54	-1%	128.26	129.15	-1%	2693559	2693559	0%
points	distance	93.48	95.96	-3%	54.92	56.38	-3%	382961957	382961957	0%
points	nearest 10	0.10	0.09	11%	9687.24	8755.72	11%	60844404	60844404	0%
points	sort	43.12	43.04	0%	43.88	43.80	0%	221118844	221118844	0%
shapes	box	66.02	52.23	26%	67.18	53.15	26%	221118844	221118844	0%
shapes	polyRussia	11.57	9.85	17%	3.30	2.81	17%	3508846	3508846	0%
shapes	poly 10	54.98	47.08	17%	34.77	29.77	17%	355809475	355809475	0%
shapes	polyMedium	5.31	4.52	17%	65.01	55.39	17%	2693559	2693559	0%
geo3d	box	79.17	66.22	20%	80.56	67.38	20%	221118844	221118844	0%
geo3d	polyRussia	0.95	0.90	5%	0.27	0.26	5%	3508671	3508671	0%
geo3d	poly 10	77.26	57.16	35%	48.85	36.14	35%	355855227	355855227	0%
geo3d	polyMedium	0.95	0.69	37%	11.62	8.50	37%	2693545	2693545	0%
geo3d	distance	95.35	76.17	25%	55.96	44.70	25%	383371884	383371884	0%

[Legacy Jira: Ignacio Vera (@iverase) on Jul 22 2019 [updated: Sep 24 2019]]

mikemccand commented 5 years ago

Seems like it should only happen when number of dimensions is large?

[Legacy Jira: Robert Muir (@rmuir) on Jul 22 2019]

mikemccand commented 5 years ago

I would expect some increase on QPS for 2D when there is correlation between the dimensions, e.g range fields.

[Legacy Jira: Ignacio Vera (@iverase) on Jul 22 2019]

mikemccand commented 5 years ago

Do you mean 1-dimensional ranges (actually 2D) or 2-dimensional ranges (actually 4D)

[Legacy Jira: Robert Muir (@rmuir) on Jul 22 2019]

mikemccand commented 5 years ago

1-dimensional ranges (2D in total).

In fact the shapes benchmark above is a 2-dimensional rage (4D in total).

[Legacy Jira: Ignacio Vera (@iverase) on Jul 22 2019]

mikemccand commented 5 years ago

I guess I'm saying, I don't see any evidence of speedups for 2 dimensions above. For > 2 dimensions, the indexing cost is worth the increased search performance, but for 2 dimensions it isn't the right tradeoff.

[Legacy Jira: Robert Muir (@rmuir) on Jul 22 2019]

mikemccand commented 5 years ago

I tried to see the effect running on 1D ranges and it is the same as above. So +1 to apply this change only when numDims > 2 as it seems the right trade off.

[Legacy Jira: Ignacio Vera (@iverase) on Jul 26 2019]

mikemccand commented 5 years ago

I have played a bit more with this idea and I wondered if we need to compute exact bounds for every split. I modified @jpountz patch so instead of computing the bounds for every split, it computes every N splits. This is controlled by a static property called SPLITS_BEFORE_EXACT_BOUNDS.

The patch can be found here: https://github.com/iverase/lucene-solr/commit/e63f8c73a86c46ec406143fcd0cb31a8371dfe63

My test show that setting this value to 4 (compute exact bounds every 4 splits) reduces the indexing overhead to around 10% and keeps almost the same performance as the previous approach. Maybe we can find a better heuristic to set such value.

In addition, this patch does not apply for dimension <= 2 and the split algorithm is reverted to the original one.

[Legacy Jira: Ignacio Vera (@iverase) on Sep 24 2019]

mikemccand commented 5 years ago

Run some benchmarks by comparing this new approach with the previous approach shown a similar query performance but a much faster indexing rate:

Approach	Index time (sec)	Index time (sec)		Force merge time (sec)	Force merge time (sec)		Index size (GB)	Index size (GB)		Reader heap (MB)	Reader heap (MB)
	Dev	Base	Diff	Dev	Base	diff	Dev	Base	Diff	Dev	Base	Diff
geo3d	163.5s	218.4s	-25%	0.0s	0.0s	0%	0.71	0.71	-0%	1.75	1.75	-0%
shapes	227.8s	319.6s	-29%	0.0s	0.0s	0%	1.27	1.27	0%	1.78	1.78	0%

Approach	Shape	M hits/sec	M hits/sec		QPS	QPS		Hit count	Hit count

|| || ||Dev||Base ||Diff||Dev||Base||Diff||Dev||Base||Diff||

geo3d	box	55.58	57.53	-3%	56.56	58.54	-3%	221118844	221118844	0%
geo3d	polyRussia	0.56	0.56	-1%	0.16	0.16	-1%	3508671	3508671	0%
geo3d	poly 10	48.87	51.25	-5%	30.90	32.41	-5%	355855227	355855227	0%
geo3d	polyMedium	0.62	0.63	-1%	7.64	7.67	-1%	2693545	2693545	0%
geo3d	distance	68.16	69.70	-2%	40.00	40.91	-2%	383371884	383371884	0%
shapes	box	45.99	46.52	-1%	46.80	47.34	-1%	221118844	221118844	0%
shapes	polyRussia	6.64	7.01	-5%	1.89	2.00	-5%	3508846	3508846	0%
shapes	poly 10	33.40	34.69	-4%	21.12	21.93	-4%	355809475	355809475	0%
shapes	polyMedium	3.07	3.30	-7%	37.62	40.43	-7%	2693559	2693559	0%

[Legacy Jira: Ignacio Vera (@iverase) on Sep 26 2019]

mikemccand commented 5 years ago

I haven't looked at the code, but based on your description this looks interesting. Maybe open a PR?

[Legacy Jira: Adrien Grand (@jpountz) on Oct 04 2019]

mikemccand commented 5 years ago

Commit 0295e281d0498bdcd58ab023ad6dd9d8804a0815 in lucene-solr's branch refs/heads/master from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0295e28

LUCENE-8928: Compute exact bounds every N splits (#926)

When building a kd-tree for dimensions n > 2, compute exact bounds for an inner node every N splits to improve the quality of the tree. N is defined by SPLITS_BEFORE_EXACT_BOUNDS which is set to 4.

[Legacy Jira: ASF subversion and git services on Oct 11 2019]

mikemccand commented 5 years ago

Commit a9c77504023b3f1e0b81dbe52537fa19f4586200 in lucene-solr's branch refs/heads/branch_8x from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a9c7750

LUCENE-8928: Compute exact bounds every N splits (#926)

When building a kd-tree for dimensions n > 2, compute exact bounds for an inner node every N splits to improve the quality of the tree. N is defined by SPLITS_BEFORE_EXACT_BOUNDS which is set to 4.

[Legacy Jira: ASF subversion and git services on Oct 11 2019]

mikemccand commented 5 years ago

Commit 0295e281d0498bdcd58ab023ad6dd9d8804a0815 in lucene-solr's branch refs/heads/jira/SOLR-13731 from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0295e28

LUCENE-8928: Compute exact bounds every N splits (#926)

When building a kd-tree for dimensions n > 2, compute exact bounds for an inner node every N splits to improve the quality of the tree. N is defined by SPLITS_BEFORE_EXACT_BOUNDS which is set to 4.

[Legacy Jira: ASF subversion and git services on Oct 14 2019]

mikemccand commented 5 years ago

The nightly benchmarks liked this change http://people.apache.org/~mikemccand/geobench.html

32% faster distance filtering with Geo3D
38% faster 10-gons filtering with Geo3D
20% faster 10-gons filtering with shapes
30% faster box filtering with Geo3D
22% faster box filtering with shapes
1% space reduction for shapes

However

~7% slower indexing for Geo3D
~8% slower indexing for shapes

[Legacy Jira: Adrien Grand (@jpountz) on Oct 14 2019]

mikemccand commented 5 years ago

@daddywri The above might be interesting to you.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 14 2019]

mikemccand commented 5 years ago

Wow :) Awesome!

[Legacy Jira: Michael McCandless (@mikemccand) on Oct 14 2019]

mikemccand commented 5 years ago

Commit 1d97e25a8a77c44d60ea5350b344cce1e48dcc71 in lucene-solr's branch refs/heads/master from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1d97e25

LUCENE-8928: Check that point is inside an edge bounding box when checking if the point belongs to the edge

[Legacy Jira: ASF subversion and git services on Oct 14 2019]

mikemccand commented 5 years ago

Commit e719ea5a42c2b4ec4bec2668056708e910f85e05 in lucene-solr's branch refs/heads/branch_8x from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e719ea5

LUCENE-8928: Check that point is inside an edge bounding box when checking if the point belongs to the edge

[Legacy Jira: ASF subversion and git services on Oct 14 2019]

mikemccand commented 5 years ago

I'm curious whether that also helps the 3D field you use to filter deals @mikemccand @msokolov.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 15 2019]

mikemccand commented 5 years ago

I'm curious whether that also helps the 3D field you use to filter deals

Yes, it will be interesting to see! It might take us a little while, and there tend to be more "real" deals around certain times of year ...

[Legacy Jira: Michael Sokolov (@msokolov) on Oct 15 2019]

mikemccand commented 4 years ago

Closing after the 8.4.0 release.

[Legacy Jira: Adrien Grand (@jpountz) on Dec 29 2019]

mikemccand / stargazers-migration-test

BKDWriter could make splitting decisions based on the actual range of values [LUCENE-8928] #925

Legacy Jira details