mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

LatLonShape: Query with the same polygon that is indexed might not match [LUCENE-8634] #633

Open mikemccand opened 5 years ago

mikemccand commented 5 years ago

If a polygon with a degenerated dimension is indexed and then an intersect query is performed with the same polygon, it might result in an empty result. For example this polygon with degenerated longitude:

POLYGON((1.401298464324817E-45 22.0, 1.401298464324817E-45 69.0, 4.8202184588118395E-40 69.0, 4.8202184588118395E-40 22.0, 1.401298464324817E-45 22.0))


Legacy Jira details

LUCENE-8634 by Ignacio Vera (@iverase) on Jan 11 2019, updated Jan 25 2019 Attachments: LUCENE-8634.patch (versions: 2)

mikemccand commented 5 years ago

Attached a test reproducing the issue.

[Legacy Jira: Ignacio Vera (@iverase) on Jan 11 2019]

mikemccand commented 5 years ago

I had a closer look and it seems it happens the same for lines. I attached a possible fix which encode/decode the points for the given polygon/line queries. 

Bounding box queries already work on the encoding space so they are not affected.

[Legacy Jira: Ignacio Vera (@iverase) on Jan 11 2019]

mikemccand commented 5 years ago
+1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
master Compile Tests
+1 compile 1m 7s master passed
Patch Compile Tests
+1 compile 0m 16s the patch passed
+1 javac 0m 16s the patch passed
+1 Release audit (RAT) 0m 16s the patch passed
+1 Check forbidden APIs 0m 16s the patch passed
+1 Validate source patterns 0m 16s the patch passed
Other Tests
+1 unit 10m 41s sandbox in the patch passed.
16m 34s
Subsystem Report/Notes
JIRA Issue LUCENE-8634
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12954596/LUCENE-8634.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / dcc9ffe
ant version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018
Default Java 1.8.0_191
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/150/testReport/
modules C: lucene/sandbox U: lucene/sandbox
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/150/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Jan 11 2019]

mikemccand commented 5 years ago

Thanks Ignacio. I understand how it makes things better but then I have two questions:

[Legacy Jira: Adrien Grand (@jpountz) on Jan 16 2019]

mikemccand commented 5 years ago

Are there concerns regarding the fact that a polygon might become invalid through encoding (and eg. cross itself).

The encoding is very fine grid with cells of size 8.39e-8 degrees X 8.39e-8 degrees. So when moving to the encoding space it can happen that two points become the same point or if close enough to an edge in theory it can cross it. This is the same case as in the tessellator and never happen during the test. I check and Polygon2D can handle when having shared vertex and could not really reproduce the other case, so I am not too worry about this.

Should we keep lines and polygons in the encoded space like boxes?

I have been wondering that all the time (It might help performance) but it is a bigger change. Sure if we move query shapes to the encoding space, then we should probably keep them there.

 

[Legacy Jira: Ignacio Vera (@iverase) on Jan 16 2019]

mikemccand commented 5 years ago

This is the same case as in the tessellator

The tessellator is a bit different because it tessellates into triangles before moving to the encoded space? The only downside is that triangles might change orientation through encoding, which is not a problem now because the encoding doesn't make any assumption about the encoding of triangles?

[Legacy Jira: Adrien Grand (@jpountz) on Jan 16 2019]

mikemccand commented 5 years ago

Mostly, the Morton optimisation order the points on the encoding space, but not sure how relevant it is. Anyway, this should only be relevant if you are defining your polygons with centimetre accuracy which it is probably not realistic.

[Legacy Jira: Ignacio Vera (@iverase) on Jan 16 2019]

mikemccand commented 5 years ago

I think it depends what happens in such a case: would we fail the query with an exception or would we only fail to match where edges cross or are too close to each other? The latter sounds acceptable to me due to the documented accuracy limitations, but failing a query with a valid polygon less so?

[Legacy Jira: Adrien Grand (@jpountz) on Jan 16 2019]

mikemccand commented 5 years ago

It won't fail as there are no checks for intersection there. So yes, the side effect would be lack of accuracy on spatial relationships.

[Legacy Jira: Ignacio Vera (@iverase) on Jan 16 2019]

mikemccand commented 5 years ago

Actually, in the example of the patch, the resulting polygon is invalid (it becomes a line). 

[Legacy Jira: Ignacio Vera (@iverase) on Jan 16 2019]

mikemccand commented 5 years ago

For clarity I'll separate the issues being discussed here:

  1. Sub centimeter polygons: I think we document the spatial resolution of the encoding fairly well. 1e-7 dec deg ~= 1.11cm. Any polygon defined with vertex distances <= 1e-7 dec deg (like the one in the example here) should not be expected to index with the same accuracy as provided. So subcentimeter polygons in the WGS84 lat/lon projection are not supported and may result in an unexpected invalid shape. This is why I opened LUCENE-8632 to lay groundwork for alternative projections. If a user wants to index subcentimeter shapes they should do so using the right spatial reference system for the job.

  2. "Should we keep lines and polygons in the encoded space like boxes?"

So I made a simple decision (occam's razor) when handling this in the first iteration of development. For point, line, and polygon query, rather than quantize the search shape in the query constructor (like BoundingBox does) I quantized the query shape in the test before invoking the query. Right or wrong I chose this route for two reasons: a. consistency with LatLonPointInPolygonQuery which we discussed this topic at length across several issues, and b. we have no formal support for the EqualTo relation operation, only INTERSECT, DISJOINT, WITHIN. In hindsight INTERSECT does fill this void so a false negative using INTERSECT query on an indexed shape that is equalTo the query shape could/should probably be considered a bug. Furthermore, LatLonPointInPolygonQuery doesn't have the complexities to contend with like relating lines and polygons. I think this change probably deserves a bit more thought / consideration.

[Legacy Jira: Nick Knize (@nknize) on Jan 25 2019]