mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Explore selective dimension indexing in BKDReader/Writer [LUCENE-8496] #495

Closed mikemccand closed 6 years ago

mikemccand commented 6 years ago

This issue explores adding a new feature to BKDReader/Writer that enables users to select a fewer number of dimensions to be used for creating the BKD index than the total number of dimensions specified for field encoding. This is useful for encoding dimensional data that is used for interpreting the encoded field data but unnecessary (or not efficient) for creating the index structure. One such example is LatLonShape encoding. The first 4 dimensions may be used to to efficiently search/index the triangle using its precomputed bounding box as a 4D point, and the remaining dimensions can be used to encode the vertices of the tessellated triangle. This causes BKD to act much like an R-Tree for shape data where search is distilled into a 4D point (instead of a more expensive 6D point) and the triangle is encoded using a portion of the remaining (non-indexed) dimensions. Fields that use the full data range for indexing are not impacted and behave as they normally would.


Legacy Jira details

LUCENE-8496 by Nick Knize (@nknize) on Sep 12 2018, resolved Oct 18 2018 Attachments: LatLonShape_SelectiveEncoding.patch, LUCENE-8496.patch (versions: 5) Pull requests: https://github.com/apache/lucene-solr/pull/451

mikemccand commented 6 years ago

Initial patch provided:

The lionshare of the changes are made to FieldType, BKDWriter, and BKDReader.

Benchmarking —

To benchmark the changes I update LatLonShape (not included in this patch) and ran benchmark tests both with and without selective indexing. The results are below:

6 dimension encoded LatLonShape w/o selective indexing


INDEX SIZE: 1.2795778876170516 GB READER MB: 1.7928361892700195 BEST M hits/sec: 11.67378231920028 BEST QPS: 6.8635445274291715 for 225 queries, totHits=382688713

7 dimension LatLonShape encoding w/ 4 dimension selective indexing


INDEX SIZE: 2.1509012933820486 GB READER MB: 1.8154268264770508 BEST M hits/sec: 17.018094815004627 BEST QPS: 10.005707519719927 for 225 queries, totHits=382688713

The gains are a little better than the differences between searching a 4d range vs a 6d range. The index size increased due to using 7 dimensions instead of 6, but I also switched over to a bit bigger encoding size.

[Legacy Jira: Nick Knize (@nknize) on Sep 12 2018]

mikemccand commented 6 years ago

It is a pity that the patch is so large given that the change is actually simple. I like the idea and the patch looks very clean overall, I see you added validation for corner-cases like rejecting dataDimensionCount>0 but indexDimensionCount==0. Out of curiosity, did your working copy already have LUCENE-7862 when you ran the benchmark? I have some minor comments on the patch, could you maybe set up a pull request or use Apache reviewboard to make it easier to comment on your changes and iterate?

[Legacy Jira: Adrien Grand (@jpountz) on Sep 14 2018]

mikemccand commented 6 years ago

It is a pity that the patch is so large

Yeah. Refactoring pointDimensionCount touched a lot of files so the patch is rather busy. I could change it to leave pointDimensionCount as is and just add a new indexDimensionCount?

Out of curiosity, did your working copy already have LUCENE-7862 when you ran the benchmark?

Yes. My benchmark numbers include the latest change to store min/max packed values. The only difference is using LatLonShape without and with the selective indexing approach.

...could you maybe set up a pull request or use Apache reviewboard

Sure thing! I went ahead and opened a PR here

[Legacy Jira: Nick Knize (@nknize) on Sep 14 2018]

mikemccand commented 6 years ago

I've attached an updated patch that is consistent with the latest PR updates. Additionally, I've attached a WIP patch that modifies LatLonShape encoding to take advantage of the selective indexing changes to boost QPS per the table listed above. I'll ultimately create a separate issue but wanted to provide it now for anyone that wants it to review the application of this feature or benchmark performance.

[Legacy Jira: Nick Knize (@nknize) on Sep 28 2018]

mikemccand commented 6 years ago

Attaching latest patch consistent with recent PR updates. I think this is about ready.

I ran two new benchmarks using 20M documents from the PlanetOSM corpus of data. The first is without selective indexing and using LatLonShape currently in master, the second is with selective indexing and using 7 dimension long encoding for LatLonShape. The first 4 index dimensions are the bounding box of the triangle (compressed to Integer size) and the remaining 3 data dimensions are the three vertices of the triangle. What's nice about this is we can arrange the three triangle vertices so that the first two represent the edge of the shape. I think this will provide a path forward for implementing the CONTAINS query.

Nevertheless, this is a great benchmark because it uses real world shape data. (LineStrings, MultiLineStrings, Polygons, and MultiPolygons - with and without holes). The geometries are quite complex and the results are looking good.

6 dimension LatLonShape w/o selective indexing: — INDEX SIZE: 4.944349942728877 GB READER MB: 2.7018051147460938 maxDoc=20000000 BEST M hits/sec: 0.9616386761834046 BEST QPS: 5.816462716249435

7 dimension LatLonShape w/ selective indexing: — INDEX SIZE: 8.462444095872343 GB READER MB: 2.5924673080444336 maxDoc=20000000 BEST M hits/sec: 1.5636273064182318 BEST QPS: 9.457585426978618

[Legacy Jira: Nick Knize (@nknize) on Oct 01 2018]

mikemccand commented 6 years ago

Updated patch that's consistent with the PR changes and feedback. Will commit this to master and 7.x pending a final QA check.

[Legacy Jira: Nick Knize (@nknize) on Oct 03 2018]

mikemccand commented 6 years ago
-1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 10 new or modified test files.
master Compile Tests
+1 compile 5m 57s master passed
Patch Compile Tests
+1 compile 6m 31s the patch passed
+1 javac 6m 31s the patch passed
+1 Release audit (RAT) 1m 44s the patch passed
+1 Check forbidden APIs 0m 17s the patch passed
+1 Validate source patterns 0m 17s the patch passed
Other Tests
-1 unit 7m 11s codecs in the patch failed.
+1 unit 31m 26s core in the patch passed.
+1 unit 2m 16s highlighter in the patch passed.
+1 unit 1m 15s join in the patch passed.
+1 unit 0m 17s memory in the patch passed.
+1 unit 4m 56s sandbox in the patch passed.
+1 unit 2m 45s spatial-extras in the patch passed.
+1 unit 5m 53s test-framework in the patch passed.
-1 unit 87m 35s core in the patch failed.
163m 16s
Reason Tests
Failed junit tests lucene.codecs.simpletext.TestSimpleTextPointsFormat
solr.cloud.autoscaling.IndexSizeTriggerTest
solr.cloud.autoscaling.sim.TestSimTriggerIntegration
Subsystem Report/Notes
JIRA Issue LUCENE-8496
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12942299/LUCENE-8496.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / 46f753d
ant version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018
Default Java 1.8.0_172
unit https://builds.apache.org/job/PreCommit-LUCENE-Build/99/artifact/out/patch-unit-lucene_codecs.txt
unit https://builds.apache.org/job/PreCommit-LUCENE-Build/99/artifact/out/patch-unit-solr_core.txt
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/99/testReport/
modules C: lucene lucene/codecs lucene/core lucene/highlighter lucene/join lucene/memory lucene/sandbox lucene/spatial-extras lucene/test-framework solr/core U: .
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/99/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Oct 04 2018]

mikemccand commented 6 years ago

Updated patch to make SimpleTextBKDWriter consistent w/ BKDWriter. Will commit pending QA

[Legacy Jira: Nick Knize (@nknize) on Oct 05 2018]

mikemccand commented 6 years ago
-1 overall
Vote Subsystem Runtime Comment
-1 patch 0m 6s LUCENE-8496 does not apply to master. Rebase required? Wrong Branch? See https://wiki.apache.org/lucene-java/HowToContribute#Contributing_your_work for help.
Subsystem Report/Notes
JIRA Issue LUCENE-8496
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12942614/LUCENE-8496.patch
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/102/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Oct 06 2018]

mikemccand commented 6 years ago

Posted bad patch. Correct patch provided for QA

[Legacy Jira: Nick Knize (@nknize) on Oct 06 2018]

mikemccand commented 6 years ago
-1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 10 new or modified test files.
master Compile Tests
+1 compile 7m 5s master passed
Patch Compile Tests
+1 compile 5m 39s the patch passed
+1 javac 5m 39s the patch passed
+1 Release audit (RAT) 0m 59s the patch passed
+1 Check forbidden APIs 0m 30s the patch passed
+1 Validate source patterns 0m 30s the patch passed
Other Tests
+1 unit 9m 5s codecs in the patch passed.
+1 unit 30m 31s core in the patch passed.
+1 unit 1m 18s highlighter in the patch passed.
+1 unit 1m 58s join in the patch passed.
+1 unit 1m 11s memory in the patch passed.
+1 unit 4m 35s sandbox in the patch passed.
+1 unit 1m 18s spatial-extras in the patch passed.
+1 unit 4m 44s test-framework in the patch passed.
-1 unit 90m 55s core in the patch failed.
165m 2s
Reason Tests
Failed junit tests solr.cloud.autoscaling.sim.TestSimPolicyCloud
Subsystem Report/Notes
JIRA Issue LUCENE-8496
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12942690/LUCENE-8496.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / 367bdf7
ant version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018
Default Java 1.8.0_172
unit https://builds.apache.org/job/PreCommit-LUCENE-Build/103/artifact/out/patch-unit-solr_core.txt
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/103/testReport/
modules C: lucene lucene/codecs lucene/core lucene/highlighter lucene/join lucene/memory lucene/sandbox lucene/spatial-extras lucene/test-framework solr/core U: .
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/103/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Oct 07 2018]

mikemccand commented 6 years ago

Failure on branch_7x: ant test -Dtestcase=TestBKD -Dtests.seed=3A807E1398CE4499 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=sr-Latn-BA -Dtests.timezone=Africa/Malabo -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

Muting test until fix is pushed.

[Legacy Jira: Nick Knize (@nknize) on Oct 10 2018]

mikemccand commented 6 years ago

I went ahead and reverted this feature from branch_7x until the backport can be cleaned up. Sorry for the noise.

[Legacy Jira: Nick Knize (@nknize) on Oct 10 2018]

mikemccand commented 6 years ago

FYI two other failing tests on branch_7x from https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/2891/ (before the commit was reverted):

ant test -Dtestcase=TestLucene60PointsFormat -Dtests.seed=B5A28E6677965A99 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=fr-CA -Dtests.timezone=Asia/Irkutsk -Dtests.asserts=true -Dtests.file.encoding=UTF-8
ant test -Dtestcase=TestAssertingPointsFormat -Dtests.seed=F280908F18AE1657 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=dz -Dtests.timezone=Etc/GMT-10 -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

[Legacy Jira: Steven Rowe on Oct 11 2018]

mikemccand commented 6 years ago

Commit 804afbfd47cc8d86ceda6ea66f0afe304af1ad1b in lucene-solr's branch refs/heads/branch_7x from @nknize https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=804afbf

LUCENE-8496: Selective indexing - modify BKDReader/BKDWriter to allow users to select a fewer number of dimensions to be used for creating the index than the total number of dimensions used for field encoding. i.e., dimensions 0 to N may be used to determine how to split the inner nodes, and dimensions N+1 to D are ignored and stored as data dimensions at the leaves.

[Legacy Jira: ASF subversion and git services on Oct 18 2018]

mikemccand commented 6 years ago

Closing; pushed to master and branch_7x

[Legacy Jira: Nick Knize (@nknize) on Oct 18 2018]