Closed mikemccand closed 6 years ago
Initial patch provided:
The lionshare of the changes are made to FieldType
, BKDWriter
, and BKDReader
.
FieldType
- split pointDimensionCount
into two new integers that define pointDataDimensionCount
and pointIndexDimensionCount
. pointIndexDimensionCount
must be <= pointDataDimensionCount
and defines the first n
dimensions that will be used to build the index. The remaining pointDataDimensionCount
- pointIndexDimensionCount
dimensions are ignored while building (e.g., split/merge) the index. Getter and Setter utility methods are added.
BKDWriter
- change writeIndex
to encode and write numIndexDims
in the 2 most significant bytes of the integer that formerly stored numDims
this provides simple backwards compatability without requiring a change to FieldInfoFormat
. Indexing methods are updated to only use the first numIndexDims
while building the tree. Leaf nodes still use numDataDims
for efficiently packing and compressing the leaf level data (data nodes).
BKDReader
- add version checking in the constructor to decode numIndexDims
and numDataDims
from the packed dimension integer. Update index reading methods to only look at the first numIndexDims
while traversing the tree. numDataDims
are still used for decoding leaf level data.
API Changes - all instances of pointDimensionCount
have been updated to pointDataDimensionCount
and pointIndexDimensionCount
to reflect total number of dimensions, and number of dimensions used for creating the index, respectively.
All POINT Tests and POINT based Fields have been updated to use the API changes.
Benchmarking —
To benchmark the changes I update LatLonShape
(not included in this patch) and ran benchmark tests both with and without selective indexing. The results are below:
6 dimension encoded LatLonShape
w/o selective indexing
INDEX SIZE: 1.2795778876170516 GB READER MB: 1.7928361892700195 BEST M hits/sec: 11.67378231920028 BEST QPS: 6.8635445274291715 for 225 queries, totHits=382688713
7 dimension LatLonShape encoding w/ 4 dimension selective indexing
INDEX SIZE: 2.1509012933820486 GB READER MB: 1.8154268264770508 BEST M hits/sec: 17.018094815004627 BEST QPS: 10.005707519719927 for 225 queries, totHits=382688713
The gains are a little better than the differences between searching a 4d range vs a 6d range. The index size increased due to using 7 dimensions instead of 6, but I also switched over to a bit bigger encoding size.
[Legacy Jira: Nick Knize (@nknize) on Sep 12 2018]
It is a pity that the patch is so large given that the change is actually simple. I like the idea and the patch looks very clean overall, I see you added validation for corner-cases like rejecting dataDimensionCount>0 but indexDimensionCount==0. Out of curiosity, did your working copy already have LUCENE-7862 when you ran the benchmark? I have some minor comments on the patch, could you maybe set up a pull request or use Apache reviewboard to make it easier to comment on your changes and iterate?
[Legacy Jira: Adrien Grand (@jpountz) on Sep 14 2018]
It is a pity that the patch is so large
Yeah. Refactoring pointDimensionCount
touched a lot of files so the patch is rather busy. I could change it to leave pointDimensionCount
as is and just add a new indexDimensionCount
?
Out of curiosity, did your working copy already have LUCENE-7862 when you ran the benchmark?
Yes. My benchmark numbers include the latest change to store min/max packed values. The only difference is using LatLonShape
without and with the selective indexing approach.
...could you maybe set up a pull request or use Apache reviewboard
Sure thing! I went ahead and opened a PR here
[Legacy Jira: Nick Knize (@nknize) on Sep 14 2018]
I've attached an updated patch that is consistent with the latest PR updates. Additionally, I've attached a WIP patch that modifies LatLonShape
encoding to take advantage of the selective indexing changes to boost QPS per the table listed above. I'll ultimately create a separate issue but wanted to provide it now for anyone that wants it to review the application of this feature or benchmark performance.
[Legacy Jira: Nick Knize (@nknize) on Sep 28 2018]
Attaching latest patch consistent with recent PR updates. I think this is about ready.
I ran two new benchmarks using 20M documents from the PlanetOSM corpus of data. The first is without selective indexing and using LatLonShape
currently in master, the second is with selective indexing and using 7 dimension long encoding for LatLonShape
. The first 4 index dimensions are the bounding box of the triangle (compressed to Integer size) and the remaining 3 data dimensions are the three vertices of the triangle. What's nice about this is we can arrange the three triangle vertices so that the first two represent the edge of the shape. I think this will provide a path forward for implementing the CONTAINS
query.
Nevertheless, this is a great benchmark because it uses real world shape data. (LineStrings, MultiLineStrings, Polygons, and MultiPolygons - with and without holes). The geometries are quite complex and the results are looking good.
6 dimension LatLonShape
w/o selective indexing:
—
INDEX SIZE: 4.944349942728877 GB
READER MB: 2.7018051147460938
maxDoc=20000000
BEST M hits/sec: 0.9616386761834046
BEST QPS: 5.816462716249435
7 dimension LatLonShape
w/ selective indexing:
—
INDEX SIZE: 8.462444095872343 GB
READER MB: 2.5924673080444336
maxDoc=20000000
BEST M hits/sec: 1.5636273064182318
BEST QPS: 9.457585426978618
[Legacy Jira: Nick Knize (@nknize) on Oct 01 2018]
Updated patch that's consistent with the PR changes and feedback. Will commit this to master and 7.x pending a final QA check.
[Legacy Jira: Nick Knize (@nknize) on Oct 03 2018]
❌ -1 overall |
---|
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
Prechecks | |||
+1 | test4tests | 0m 0s | The patch appears to include 10 new or modified test files. |
master Compile Tests | |||
+1 | compile | 5m 57s | master passed |
Patch Compile Tests | |||
+1 | compile | 6m 31s | the patch passed |
+1 | javac | 6m 31s | the patch passed |
+1 | Release audit (RAT) | 1m 44s | the patch passed |
+1 | Check forbidden APIs | 0m 17s | the patch passed |
+1 | Validate source patterns | 0m 17s | the patch passed |
Other Tests | |||
-1 | unit | 7m 11s | codecs in the patch failed. |
+1 | unit | 31m 26s | core in the patch passed. |
+1 | unit | 2m 16s | highlighter in the patch passed. |
+1 | unit | 1m 15s | join in the patch passed. |
+1 | unit | 0m 17s | memory in the patch passed. |
+1 | unit | 4m 56s | sandbox in the patch passed. |
+1 | unit | 2m 45s | spatial-extras in the patch passed. |
+1 | unit | 5m 53s | test-framework in the patch passed. |
-1 | unit | 87m 35s | core in the patch failed. |
163m 16s |
Reason | Tests |
---|---|
Failed junit tests | lucene.codecs.simpletext.TestSimpleTextPointsFormat |
solr.cloud.autoscaling.IndexSizeTriggerTest | |
solr.cloud.autoscaling.sim.TestSimTriggerIntegration |
Subsystem | Report/Notes |
---|---|
JIRA Issue | LUCENE-8496 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12942299/LUCENE-8496.patch |
Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns |
uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | ant |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
git revision | master / 46f753d |
ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
Default Java | 1.8.0_172 |
unit | https://builds.apache.org/job/PreCommit-LUCENE-Build/99/artifact/out/patch-unit-lucene_codecs.txt |
unit | https://builds.apache.org/job/PreCommit-LUCENE-Build/99/artifact/out/patch-unit-solr_core.txt |
Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/99/testReport/ |
modules | C: lucene lucene/codecs lucene/core lucene/highlighter lucene/join lucene/memory lucene/sandbox lucene/spatial-extras lucene/test-framework solr/core U: . |
Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/99/console |
Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
[Legacy Jira: Lucene/Solr QA on Oct 04 2018]
Updated patch to make SimpleTextBKDWriter
consistent w/ BKDWriter
. Will commit pending QA
[Legacy Jira: Nick Knize (@nknize) on Oct 05 2018]
❌ -1 overall |
---|
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
-1 | patch | 0m 6s | LUCENE-8496 does not apply to master. Rebase required? Wrong Branch? See https://wiki.apache.org/lucene-java/HowToContribute#Contributing_your_work for help. |
Subsystem | Report/Notes |
---|---|
JIRA Issue | LUCENE-8496 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12942614/LUCENE-8496.patch |
Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/102/console |
Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
[Legacy Jira: Lucene/Solr QA on Oct 06 2018]
Posted bad patch. Correct patch provided for QA
[Legacy Jira: Nick Knize (@nknize) on Oct 06 2018]
❌ -1 overall |
---|
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
Prechecks | |||
+1 | test4tests | 0m 0s | The patch appears to include 10 new or modified test files. |
master Compile Tests | |||
+1 | compile | 7m 5s | master passed |
Patch Compile Tests | |||
+1 | compile | 5m 39s | the patch passed |
+1 | javac | 5m 39s | the patch passed |
+1 | Release audit (RAT) | 0m 59s | the patch passed |
+1 | Check forbidden APIs | 0m 30s | the patch passed |
+1 | Validate source patterns | 0m 30s | the patch passed |
Other Tests | |||
+1 | unit | 9m 5s | codecs in the patch passed. |
+1 | unit | 30m 31s | core in the patch passed. |
+1 | unit | 1m 18s | highlighter in the patch passed. |
+1 | unit | 1m 58s | join in the patch passed. |
+1 | unit | 1m 11s | memory in the patch passed. |
+1 | unit | 4m 35s | sandbox in the patch passed. |
+1 | unit | 1m 18s | spatial-extras in the patch passed. |
+1 | unit | 4m 44s | test-framework in the patch passed. |
-1 | unit | 90m 55s | core in the patch failed. |
165m 2s |
Reason | Tests |
---|---|
Failed junit tests | solr.cloud.autoscaling.sim.TestSimPolicyCloud |
Subsystem | Report/Notes |
---|---|
JIRA Issue | LUCENE-8496 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12942690/LUCENE-8496.patch |
Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns |
uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | ant |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
git revision | master / 367bdf7 |
ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
Default Java | 1.8.0_172 |
unit | https://builds.apache.org/job/PreCommit-LUCENE-Build/103/artifact/out/patch-unit-solr_core.txt |
Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/103/testReport/ |
modules | C: lucene lucene/codecs lucene/core lucene/highlighter lucene/join lucene/memory lucene/sandbox lucene/spatial-extras lucene/test-framework solr/core U: . |
Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/103/console |
Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
[Legacy Jira: Lucene/Solr QA on Oct 07 2018]
Failure on branch_7x:
ant test -Dtestcase=TestBKD -Dtests.seed=3A807E1398CE4499 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=sr-Latn-BA -Dtests.timezone=Africa/Malabo -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
Muting test until fix is pushed.
[Legacy Jira: Nick Knize (@nknize) on Oct 10 2018]
I went ahead and reverted this feature from branch_7x until the backport can be cleaned up. Sorry for the noise.
[Legacy Jira: Nick Knize (@nknize) on Oct 10 2018]
FYI two other failing tests on branch_7x from https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/2891/ (before the commit was reverted):
ant test -Dtestcase=TestLucene60PointsFormat -Dtests.seed=B5A28E6677965A99 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=fr-CA -Dtests.timezone=Asia/Irkutsk -Dtests.asserts=true -Dtests.file.encoding=UTF-8
ant test -Dtestcase=TestAssertingPointsFormat -Dtests.seed=F280908F18AE1657 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=dz -Dtests.timezone=Etc/GMT-10 -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
[Legacy Jira: Steven Rowe on Oct 11 2018]
Commit 804afbfd47cc8d86ceda6ea66f0afe304af1ad1b in lucene-solr's branch refs/heads/branch_7x from @nknize https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=804afbf
LUCENE-8496: Selective indexing - modify BKDReader/BKDWriter to allow users to select a fewer number of dimensions to be used for creating the index than the total number of dimensions used for field encoding. i.e., dimensions 0 to N may be used to determine how to split the inner nodes, and dimensions N+1 to D are ignored and stored as data dimensions at the leaves.
[Legacy Jira: ASF subversion and git services on Oct 18 2018]
Closing; pushed to master and branch_7x
[Legacy Jira: Nick Knize (@nknize) on Oct 18 2018]
This issue explores adding a new feature to BKDReader/Writer that enables users to select a fewer number of dimensions to be used for creating the BKD index than the total number of dimensions specified for field encoding. This is useful for encoding dimensional data that is used for interpreting the encoded field data but unnecessary (or not efficient) for creating the index structure. One such example is
LatLonShape
encoding. The first 4 dimensions may be used to to efficiently search/index the triangle using its precomputed bounding box as a 4D point, and the remaining dimensions can be used to encode the vertices of the tessellated triangle. This causes BKD to act much like an R-Tree for shape data where search is distilled into a 4D point (instead of a more expensive 6D point) and the triangle is encoded using a portion of the remaining (non-indexed) dimensions. Fields that use the full data range for indexing are not impacted and behave as they normally would.Legacy Jira details
LUCENE-8496 by Nick Knize (@nknize) on Sep 12 2018, resolved Oct 18 2018 Attachments: LatLonShape_SelectiveEncoding.patch, LUCENE-8496.patch (versions: 5) Pull requests: https://github.com/apache/lucene-solr/pull/451