Explore selective dimension indexing in BKDReader/Writer [LUCENE-8496]

mikemccand commented 6 years ago

This issue explores adding a new feature to BKDReader/Writer that enables users to select a fewer number of dimensions to be used for creating the BKD index than the total number of dimensions specified for field encoding. This is useful for encoding dimensional data that is used for interpreting the encoded field data but unnecessary (or not efficient) for creating the index structure. One such example is LatLonShape encoding. The first 4 dimensions may be used to to efficiently search/index the triangle using its precomputed bounding box as a 4D point, and the remaining dimensions can be used to encode the vertices of the tessellated triangle. This causes BKD to act much like an R-Tree for shape data where search is distilled into a 4D point (instead of a more expensive 6D point) and the triangle is encoded using a portion of the remaining (non-indexed) dimensions. Fields that use the full data range for indexing are not impacted and behave as they normally would.

Legacy Jira details

LUCENE-8496 by Nick Knize (@nknize) on Sep 12 2018, resolved Oct 18 2018 Attachments: LatLonShape_SelectiveEncoding.patch, LUCENE-8496.patch (versions: 5) Pull requests: https://github.com/apache/lucene-solr/pull/451

mikemccand commented 6 years ago

Initial patch provided:

The lionshare of the changes are made to FieldType, BKDWriter, and BKDReader.

FieldType - split pointDimensionCount into two new integers that define pointDataDimensionCount and pointIndexDimensionCount. pointIndexDimensionCount must be <= pointDataDimensionCount and defines the first n dimensions that will be used to build the index. The remaining pointDataDimensionCount - pointIndexDimensionCount dimensions are ignored while building (e.g., split/merge) the index. Getter and Setter utility methods are added.
BKDWriter - change writeIndex to encode and write numIndexDims in the 2 most significant bytes of the integer that formerly stored numDims this provides simple backwards compatability without requiring a change to FieldInfoFormat. Indexing methods are updated to only use the first numIndexDims while building the tree. Leaf nodes still use numDataDims for efficiently packing and compressing the leaf level data (data nodes).
BKDReader - add version checking in the constructor to decode numIndexDims and numDataDims from the packed dimension integer. Update index reading methods to only look at the first numIndexDims while traversing the tree. numDataDims are still used for decoding leaf level data.
API Changes - all instances of pointDimensionCount have been updated to pointDataDimensionCount and pointIndexDimensionCount to reflect total number of dimensions, and number of dimensions used for creating the index, respectively.
All POINT Tests and POINT based Fields have been updated to use the API changes.

Benchmarking —

To benchmark the changes I update LatLonShape (not included in this patch) and ran benchmark tests both with and without selective indexing. The results are below:

6 dimension encoded LatLonShape w/o selective indexing

INDEX SIZE: 1.2795778876170516 GB READER MB: 1.7928361892700195 BEST M hits/sec: 11.67378231920028 BEST QPS: 6.8635445274291715 for 225 queries, totHits=382688713

7 dimension LatLonShape encoding w/ 4 dimension selective indexing

INDEX SIZE: 2.1509012933820486 GB READER MB: 1.8154268264770508 BEST M hits/sec: 17.018094815004627 BEST QPS: 10.005707519719927 for 225 queries, totHits=382688713

The gains are a little better than the differences between searching a 4d range vs a 6d range. The index size increased due to using 7 dimensions instead of 6, but I also switched over to a bit bigger encoding size.

[Legacy Jira: Nick Knize (@nknize) on Sep 12 2018]

mikemccand commented 6 years ago

It is a pity that the patch is so large given that the change is actually simple. I like the idea and the patch looks very clean overall, I see you added validation for corner-cases like rejecting dataDimensionCount>0 but indexDimensionCount==0. Out of curiosity, did your working copy already have LUCENE-7862 when you ran the benchmark? I have some minor comments on the patch, could you maybe set up a pull request or use Apache reviewboard to make it easier to comment on your changes and iterate?

[Legacy Jira: Adrien Grand (@jpountz) on Sep 14 2018]

mikemccand commented 6 years ago

It is a pity that the patch is so large

Yeah. Refactoring pointDimensionCount touched a lot of files so the patch is rather busy. I could change it to leave pointDimensionCount as is and just add a new indexDimensionCount?

Out of curiosity, did your working copy already have LUCENE-7862 when you ran the benchmark?

Yes. My benchmark numbers include the latest change to store min/max packed values. The only difference is using LatLonShape without and with the selective indexing approach.

...could you maybe set up a pull request or use Apache reviewboard

Sure thing! I went ahead and opened a PR here

[Legacy Jira: Nick Knize (@nknize) on Sep 14 2018]

mikemccand commented 6 years ago

I've attached an updated patch that is consistent with the latest PR updates. Additionally, I've attached a WIP patch that modifies LatLonShape encoding to take advantage of the selective indexing changes to boost QPS per the table listed above. I'll ultimately create a separate issue but wanted to provide it now for anyone that wants it to review the application of this feature or benchmark performance.

[Legacy Jira: Nick Knize (@nknize) on Sep 28 2018]

mikemccand commented 6 years ago

Attaching latest patch consistent with recent PR updates. I think this is about ready.

I ran two new benchmarks using 20M documents from the PlanetOSM corpus of data. The first is without selective indexing and using LatLonShape currently in master, the second is with selective indexing and using 7 dimension long encoding for LatLonShape. The first 4 index dimensions are the bounding box of the triangle (compressed to Integer size) and the remaining 3 data dimensions are the three vertices of the triangle. What's nice about this is we can arrange the three triangle vertices so that the first two represent the edge of the shape. I think this will provide a path forward for implementing the CONTAINS query.

Nevertheless, this is a great benchmark because it uses real world shape data. (LineStrings, MultiLineStrings, Polygons, and MultiPolygons - with and without holes). The geometries are quite complex and the results are looking good.

6 dimension LatLonShape w/o selective indexing: — INDEX SIZE: 4.944349942728877 GB READER MB: 2.7018051147460938 maxDoc=20000000 BEST M hits/sec: 0.9616386761834046 BEST QPS: 5.816462716249435

7 dimension LatLonShape w/ selective indexing: — INDEX SIZE: 8.462444095872343 GB READER MB: 2.5924673080444336 maxDoc=20000000 BEST M hits/sec: 1.5636273064182318 BEST QPS: 9.457585426978618

[Legacy Jira: Nick Knize (@nknize) on Oct 01 2018]

mikemccand commented 6 years ago

Updated patch that's consistent with the PR changes and feedback. Will commit this to master and 7.x pending a final QA check.

[Legacy Jira: Nick Knize (@nknize) on Oct 03 2018]

mikemccand commented 6 years ago

❌ -1 overall

Vote	Subsystem	Runtime	Comment
Prechecks
+1	test4tests	0m 0s	The patch appears to include 10 new or modified test files.
master Compile Tests
+1	compile	5m 57s	master passed
Patch Compile Tests
+1	compile	6m 31s	the patch passed
+1	javac	6m 31s	the patch passed
+1	Release audit (RAT)	1m 44s	the patch passed
+1	Check forbidden APIs	0m 17s	the patch passed
+1	Validate source patterns	0m 17s	the patch passed
Other Tests
-1	unit	7m 11s	codecs in the patch failed.
+1	unit	31m 26s	core in the patch passed.
+1	unit	2m 16s	highlighter in the patch passed.
+1	unit	1m 15s	join in the patch passed.
+1	unit	0m 17s	memory in the patch passed.
+1	unit	4m 56s	sandbox in the patch passed.
+1	unit	2m 45s	spatial-extras in the patch passed.
+1	unit	5m 53s	test-framework in the patch passed.
-1	unit	87m 35s	core in the patch failed.
	163m 16s

Reason	Tests
Failed junit tests	lucene.codecs.simpletext.TestSimpleTextPointsFormat
solr.cloud.autoscaling.IndexSizeTriggerTest
solr.cloud.autoscaling.sim.TestSimTriggerIntegration

Subsystem	Report/Notes
JIRA Issue	LUCENE-8496
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12942299/LUCENE-8496.patch
Optional Tests	compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname	Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	ant
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision	master / 46f753d
ant	version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018
Default Java	1.8.0_172
unit	https://builds.apache.org/job/PreCommit-LUCENE-Build/99/artifact/out/patch-unit-lucene_codecs.txt
unit	https://builds.apache.org/job/PreCommit-LUCENE-Build/99/artifact/out/patch-unit-solr_core.txt
Test Results	https://builds.apache.org/job/PreCommit-LUCENE-Build/99/testReport/
modules	C: lucene lucene/codecs lucene/core lucene/highlighter lucene/join lucene/memory lucene/sandbox lucene/spatial-extras lucene/test-framework solr/core U: .
Console output	https://builds.apache.org/job/PreCommit-LUCENE-Build/99/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Oct 04 2018]

mikemccand commented 6 years ago

Updated patch to make SimpleTextBKDWriter consistent w/ BKDWriter. Will commit pending QA

[Legacy Jira: Nick Knize (@nknize) on Oct 05 2018]

mikemccand commented 6 years ago

❌ -1 overall

Vote	Subsystem	Runtime	Comment
-1	patch	0m 6s	LUCENE-8496 does not apply to master. Rebase required? Wrong Branch? See https://wiki.apache.org/lucene-java/HowToContribute#Contributing_your_work for help.

Subsystem	Report/Notes
JIRA Issue	LUCENE-8496
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12942614/LUCENE-8496.patch
Console output	https://builds.apache.org/job/PreCommit-LUCENE-Build/102/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Oct 06 2018]

mikemccand commented 6 years ago

Posted bad patch. Correct patch provided for QA

[Legacy Jira: Nick Knize (@nknize) on Oct 06 2018]

mikemccand commented 6 years ago

❌ -1 overall

Vote	Subsystem	Runtime	Comment
Prechecks
+1	test4tests	0m 0s	The patch appears to include 10 new or modified test files.
master Compile Tests
+1	compile	7m 5s	master passed
Patch Compile Tests
+1	compile	5m 39s	the patch passed
+1	javac	5m 39s	the patch passed
+1	Release audit (RAT)	0m 59s	the patch passed
+1	Check forbidden APIs	0m 30s	the patch passed
+1	Validate source patterns	0m 30s	the patch passed
Other Tests
+1	unit	9m 5s	codecs in the patch passed.
+1	unit	30m 31s	core in the patch passed.
+1	unit	1m 18s	highlighter in the patch passed.
+1	unit	1m 58s	join in the patch passed.
+1	unit	1m 11s	memory in the patch passed.
+1	unit	4m 35s	sandbox in the patch passed.
+1	unit	1m 18s	spatial-extras in the patch passed.
+1	unit	4m 44s	test-framework in the patch passed.
-1	unit	90m 55s	core in the patch failed.
	165m 2s

Reason	Tests
Failed junit tests	solr.cloud.autoscaling.sim.TestSimPolicyCloud

Subsystem	Report/Notes
JIRA Issue	LUCENE-8496
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12942690/LUCENE-8496.patch
Optional Tests	compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname	Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	ant
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision	master / 367bdf7
ant	version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018
Default Java	1.8.0_172
unit	https://builds.apache.org/job/PreCommit-LUCENE-Build/103/artifact/out/patch-unit-solr_core.txt
Test Results	https://builds.apache.org/job/PreCommit-LUCENE-Build/103/testReport/
modules	C: lucene lucene/codecs lucene/core lucene/highlighter lucene/join lucene/memory lucene/sandbox lucene/spatial-extras lucene/test-framework solr/core U: .
Console output	https://builds.apache.org/job/PreCommit-LUCENE-Build/103/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Oct 07 2018]

mikemccand commented 6 years ago

Failure on branch_7x: ant test -Dtestcase=TestBKD -Dtests.seed=3A807E1398CE4499 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=sr-Latn-BA -Dtests.timezone=Africa/Malabo -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

Muting test until fix is pushed.

[Legacy Jira: Nick Knize (@nknize) on Oct 10 2018]

mikemccand commented 6 years ago

I went ahead and reverted this feature from branch_7x until the backport can be cleaned up. Sorry for the noise.

[Legacy Jira: Nick Knize (@nknize) on Oct 10 2018]

mikemccand commented 6 years ago

FYI two other failing tests on branch_7x from https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/2891/ (before the commit was reverted):

ant test -Dtestcase=TestLucene60PointsFormat -Dtests.seed=B5A28E6677965A99 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=fr-CA -Dtests.timezone=Asia/Irkutsk -Dtests.asserts=true -Dtests.file.encoding=UTF-8

ant test -Dtestcase=TestAssertingPointsFormat -Dtests.seed=F280908F18AE1657 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=dz -Dtests.timezone=Etc/GMT-10 -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

[Legacy Jira: Steven Rowe on Oct 11 2018]

mikemccand commented 6 years ago

Commit 804afbfd47cc8d86ceda6ea66f0afe304af1ad1b in lucene-solr's branch refs/heads/branch_7x from @nknize https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=804afbf

LUCENE-8496: Selective indexing - modify BKDReader/BKDWriter to allow users to select a fewer number of dimensions to be used for creating the index than the total number of dimensions used for field encoding. i.e., dimensions 0 to N may be used to determine how to split the inner nodes, and dimensions N+1 to D are ignored and stored as data dimensions at the leaves.

[Legacy Jira: ASF subversion and git services on Oct 18 2018]

mikemccand commented 6 years ago

Closing; pushed to master and branch_7x

[Legacy Jira: Nick Knize (@nknize) on Oct 18 2018]

mikemccand / stargazers-migration-test

Explore selective dimension indexing in BKDReader/Writer [LUCENE-8496] #495

Legacy Jira details