Closed mikemccand closed 5 years ago
I've attached a WIP patch for initial feedback.
Patch summary:
LatLonShape
abstract class that uses factory methods to index and search.createIndexableFields
static method creates an array of TriangleField
objectsTriangleField
array calling document.add
which will add each tessellated triangle to the indexLatLonPoint
the LatLonShape.newBoxQuery
method will create a new LatLonShapeBoundingBoxQuery
TestLatLonShapeQueries
class that is very similar to BaseGeoPointTestCase
in that it will test 3 different sized random data sets (Tiny, Medium, Big)geo.TestTessellator
and document.TestLatLonShape
that contain some explicit testing for each of those classesJavadocs are included. There are many todos, such as generalizing the bounding box query into a generic shape query that will query indexed shapes with provided query shapes. I think this can be done rather easily by adding a relateTriangle
utility method to Polygon2D
for computing relations of indexed triangles with the target shape. That, however, will be left as a future feature.
[Legacy Jira: Nick Knize (@nknize) on Jul 12 2018]
❌ -1 overall |
---|
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
Prechecks | |||
+1 | test4tests | 0m 0s | The patch appears to include 3 new or modified test files. |
master Compile Tests | |||
+1 | compile | 0m 57s | master passed |
Patch Compile Tests | |||
-1 | compile | 0m 8s | sandbox in the patch failed. |
-1 | javac | 0m 7s | sandbox in the patch failed. |
+1 | Release audit (RAT) | 0m 32s | Release audit (RAT) rat-sources passed |
-1 | Release audit (RAT) | 0m 4s | sandbox in the patch failed. |
+1 | Release audit (RAT) | 0m 7s | Release audit (RAT) rat-sources passed |
-1 | Check forbidden APIs | 0m 32s | Check forbidden APIs check-forbidden-apis failed |
-1 | Validate source patterns | 0m 32s | Check forbidden APIs check-forbidden-apis failed |
Other Tests | |||
+1 | unit | 21m 12s | core in the patch passed. |
-1 | unit | 0m 7s | sandbox in the patch failed. |
+1 | unit | 3m 26s | test-framework in the patch passed. |
27m 51s |
This message was automatically generated.
[Legacy Jira: Lucene/Solr QA on Jul 13 2018]
This looks awesome. Indexing two additional dimensions sounds worth it to me if it helps index way fewer fields. It's also exciting we're exercising high numbers of dimensions with the points API. Since this is targeting sandbox and the patch is a bit large, I think it's easier to get it in soon and iterate from there? It looks pretty clean to me already. Let's maybe just make windingOrder final on the Polygon class first.
Some other comments I had while skimming through the patch:
[Legacy Jira: Adrien Grand (@jpountz) on Jul 13 2018]
do quantization first [...] and then make tessellation work directly in the quantized space
Now that I think about it, it's probably easier to do it the other way around, as the polygon could easily become invalid through quantization? Having triangles hold doubles rather than quantized ints might also help if we want to further split these triangles in the future, eg. to prevent flat triangles from increasing the area of the intersection of MBRs of sibling nodes in the tree.
[Legacy Jira: Adrien Grand (@jpountz) on Jul 13 2018]
This looks really cool! It is great to see many dimensions being used with points. +1 to push to sandbox and iterate there.
[Legacy Jira: Michael McCandless (@mikemccand) on Jul 13 2018]
+1 super cool Nick! I like how the use of a tessellation technique can represent polygons with a number of triangles on the order of the number of edge vertices, and you wind up with perfect accuracy & scalability. There are also some off the shelf computational geometry libraries that can simplify polygons (JTS can), and some users may want to do that when indexing shapes.
In your example you show a quad grid index of today decomposing a shape into a million terms. Why would someone do this? If someone wants to index shapes today with spatial-extras, they ought to use SerializedDVStrategy for accuracy combined with RecursivePrefixTreeStrategy for a grid index (perhaps 20% distErrPct) and CompositeSpatialStrategy wrapping both. The number of terms for any shape is effectively capped and controlled indirectly via distErrPct – perhaps 100 terms for distErrPct=0.2? (not sure without trying).
[Legacy Jira: David Smiley (@dsmiley) on Jul 13 2018]
Thanks @jpountz, @mikemccand, and @dsmiley some great feedback here:
Great suggestions @jpountz.
I'll make these changes, clean up the code, and push to sandbox for iterating.
@dsmiley
There are also some off the shelf computational geometry libraries that can simplify polygons (JTS can), and some users may want to do that when indexing shapes.
I took a look at several of the OTS simplification a while ago. I'm sure there are many more out there than what I canvassed so this isn't intended to be a catch all comment. And we could certainly look at other tessellation libraries in the future for iterative performance improvements. But I did look at the JTS simplification tools. The problem with those varied: from tolerance value errors creating invalid polygons to general performance overhead. Each were great as a stand alone utility, but none were reliable nor performant enough for the scale desired.
they ought to use SerializedDVStrategy for accuracy combined with RecursivePrefixTreeStrategy for a grid index
Indeed this is an optimization allowing one to index "larger" quad cells (to reduce the number of terms in the inverted index) and subsequently use the WKB BinaryDocValues
for accurate relations. But not only is the use of BinaryDocValues with WKB limiting (with respect to shapes with large number of vertices and the marshalling/unmarshalling overhead) but all of the calls to .relate
(with either JTS, S2, or other third party implementations) incur their own performance penalties.. not just for relating each quad cell to the query shape but also with having to unmarshall "leaf shapes" and compute the DE9IM to relate those to the query shape for accuracy.
combined with...and...wrapping both
This was the other motivator... Like LatLonPoint
its nice to have a simple API (hopefully intended for core) that can solve the general Shape indexing and search use case without having to decide which PrefixTree, Grid, and Shape Relation libraries to use. Then for the more expert use cases users can always move over to spatial-extras
for all of the additional choices.
[Legacy Jira: Nick Knize (@nknize) on Jul 13 2018]
I definitely understand that this new shape indexing scheme is better than serialized geometry + a grid, and why it is – no need to sell me :). The benchmarks will show an amazing improvement, I'm sure – especially for shapes with a high number of vertexes.
I want to list some use-cases in which spatial-extras is (still) useful. This list might go into documentation for spatial-extras. A new LatLonShape here reduced the list somewhat – certainly for a common case (albeit indexing anything other than points simply isn't common among search/Lucene users). Perhaps in time LatLonShape will mature in capability enough for us to remove spatial-extras if there's not much worth saving for the complexity. It'd be helpful if you could validate your understanding against mine below.
[Legacy Jira: David Smiley (@dsmiley) on Jul 13 2018]
I took a peek at the patch. One bit surprised me. There is an optimization with a comment as follows:
// If all docs have exactly one value and the cost is greater
// than half the leaf size then maybe we can make things faster
// by computing the set of documents that do NOT match the query
But isn't that impossible to detect when the indexed data is comprised of multiple triangles per document? It uses PointValues.size() to detect this but that value isn't useful here, right? I'm guessing you copy-pasted this logic for LatLonPoint code where it does apply.
[Legacy Jira: David Smiley (@dsmiley) on Jul 13 2018]
Updated patch incorporating feedback from first round review.
[Legacy Jira: Nick Knize (@nknize) on Jul 13 2018]
But isn't that impossible to detect
Extremely unlikely to next to impossible? Yes, I think so. So I'm pretty confident this logic can be removed.
[Legacy Jira: Nick Knize (@nknize) on Jul 13 2018]
Commit b5ef13330f179ac6a2d98b9dfa4803811190e773 in lucene-solr's branch refs/heads/master from @nknize https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b5ef133
LUCENE-8396: Add Points Based Shape Indexing and Search that decomposes shapes into a triangular mesh and indexes individual triangles as a 6 dimension point
[Legacy Jira: ASF subversion and git services on Jul 14 2018]
Commit 3b1714e737049ec562c8dce66fc5fe8cdad74097 in lucene-solr's branch refs/heads/master from @nknize https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3b1714e
LUCENE-8396: silence random large poly test - for now
[Legacy Jira: ASF subversion and git services on Jul 14 2018]
Commit bc29580378d78f707b5cf26da8321980e907573c in lucene-solr's branch refs/heads/branch_7x from @nknize https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bc29580
LUCENE-8396: Add Points Based Shape Indexing and Search that decomposes shapes into a triangular mesh and indexes individual triangles as a 6 dimension point
[Legacy Jira: ASF subversion and git services on Jul 14 2018]
❌ -1 overall |
---|
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
-1 | patch | 0m 5s | LUCENE-8396 does not apply to master. Rebase required? Wrong Branch? See https://wiki.apache.org/lucene-java/HowToContribute#Contributing_your_work for help. |
Subsystem | Report/Notes |
---|---|
JIRA Issue | LUCENE-8396 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931617/LUCENE-8396.patch |
Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/47/console |
Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
[Legacy Jira: Lucene/Solr QA on Jul 14 2018]
Test Failure....
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestLatLonShapeQueries -Dtests.method=testRandomMedium -Dtests.seed=70F032E1B960036E -Dtests.multiplier=3 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=th -Dtests.timezone=US/Eastern -Dtests.asserts=true -Dtests.file.encoding=UTF-8
[junit4] FAILURE 24.3s J1 | TestLatLonShapeQueries.testRandomMedium <<<
[junit4] > Throwable #1: java.lang.AssertionError: wrong hit (first of possibly more):
[junit4] > FAIL: id=7542 should match but did not
[junit4] > query=LatLonShapeBoundingBoxQuery: field=shape:Rectangle(lat=0.0 TO 0.0 lon=8.381903171539307E-8 TO 90.07557132281363) docID=9514
[junit4] > polygon=[-4.190951585769653E-8, 0.0] [89.99999995809048, 0.0] [89.99999995809048, 179.99999991618097] [-4.190951585769653E-8, 0.0]
[junit4] > deleted?=false rect=Rectangle(0.0 TO 0.0 lon=8.381903171539307E-8 TO 90.07557132281363)
[junit4] > at __randomizedtesting.SeedInfo.seed([70F032E1B960036E:CD2E0549F8056008]:0)
[junit4] > at org.apache.lucene.document.TestLatLonShapeQueries.verifyRandomBBoxes(TestLatLonShapeQueries.java:263)
[junit4] > at org.apache.lucene.document.TestLatLonShapeQueries.verify(TestLatLonShapeQueries.java:134)
[junit4] > at org.apache.lucene.document.TestLatLonShapeQueries.doTestRandom(TestLatLonShapeQueries.java:130)
[junit4] > at org.apache.lucene.document.TestLatLonShapeQueries.testRandomMedium(TestLatLonShapeQueries.java:102)
[junit4] > at java.lang.Thread.run(Thread.java:748)
[junit4] 2> NOTE: leaving temporary files on disk at: /home/jenkins/workspace/Lucene-Solr-BadApples-master-Linux/lucene/build/sandbox/
[Legacy Jira: Nick Knize (@nknize) on Jul 14 2018]
Test failure, from https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Windows/726/:
Checking out Revision a9f129190f9065c8775a628df181fb53248db488 (refs/remotes/origin/branch_7x)
[...]
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestLatLonShapeQueries -Dtests.method=testRandomTiny -Dtests.seed=1BAECFBE374F07D6 -Dtests.slow=true -Dtests.locale=en-PN -Dtests.timezone=America/Edmonton -Dtests.asserts=true -Dtests.file.encoding=UTF-8
[junit4] FAILURE 0.03s J1 | TestLatLonShapeQueries.testRandomTiny <<<
[junit4] > Throwable #1: java.lang.AssertionError: wrong hit (first of possibly more):
[junit4] > FAIL: id=11 should not match but did
[junit4] > query=LatLonShapeBoundingBoxQuery: field=shape:Rectangle(lat=-52.35433959402144 TO 58.83083060849458 lon=-79.63909948244691 TO 179.99999991618097) docID=11
[junit4] > polygon=[-4.190951585769653E-8, -152.01172342523932] [89.99999995809048, -152.01172342523932] [89.99999995809048, -49.80329998768866] [-4.190951585769653E-8, -152.01172342523932]
[junit4] > deleted?=false rect=Rectangle(-52.35433959402144 TO 58.83083060849458 lon=-79.63909948244691 TO 179.99999991618097)
[junit4] > at __randomizedtesting.SeedInfo.seed([1BAECFBE374F07D6:52E911F8696E3F7A]:0)
[junit4] > at org.apache.lucene.document.TestLatLonShapeQueries.verifyRandomBBoxes(TestLatLonShapeQueries.java:262)
[junit4] > at org.apache.lucene.document.TestLatLonShapeQueries.verify(TestLatLonShapeQueries.java:133)
[junit4] > at org.apache.lucene.document.TestLatLonShapeQueries.doTestRandom(TestLatLonShapeQueries.java:129)
[junit4] > at org.apache.lucene.document.TestLatLonShapeQueries.testRandomTiny(TestLatLonShapeQueries.java:97)
[junit4] > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit4] > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[junit4] > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[junit4] > at java.base/java.lang.reflect.Method.invoke(Method.java:564)
[junit4] > at java.base/java.lang.Thread.run(Thread.java:844)
[...]
[junit4] 2> NOTE: test params are: codec=Asserting(Lucene70), sim=RandomSimilarity(queryNorm=true): {}, locale=en-PN, timezone=America/Edmonton
[Legacy Jira: Steven Rowe on Aug 01 2018]
I've been tinkering with this for a while and would like to solicit some feedback. I'd like to introduce a new shape field based on the BKD/Points codec to bring much of the Points based performance improvements to the shape indexing and search usecase. Much like the existing shape indexing in
spatial-extras
the shape will be decomposed into smaller parts, but instead of decomposing into quad cells (which have the drawback of precision accuracy and sheer volume of terms) I'd like to explore decomposing the shapes into a triangular mesh; similar to gaming and computer graphics. Not only does this approach reduce the number of terms, but it has the added benefit of better accuracy (precision is based on the index encoding technique instead of the spatial resolution of the quad cell).For better clarity, consider the following illustrations (of a polygon in a 1 degree x 1 degree spatial area). The first is using the quad tree technique applied in the existing inverted index. The second is using a triangular mesh decomposition as used by popular OpenGL and javascript rendering systems (such as those used by mapbox).
Decomposing this shape using a quad tree results in 1,105,889 quad terms at 3 meter spatial resolution.
Decomposing using a triangular mesh results in 8 triangles at the same resolution as
encodeLat/Lon
.The decomposed triangles can then be encoded as a 6 dimensional POINT and queries are implemented using the computed relations against these triangles (similar to how its done with the inverted index today).
Legacy Jira details
LUCENE-8396 by Nick Knize (@nknize) on Jul 12 2018, resolved Nov 01 2018 Attachments: LUCENE-8396.patch (versions: 2), polyWHole.png, tessellatedPoly.png