google / s2geometry

Computational geometry and spatial indexing on the sphere
http://s2geometry.io/
Apache License 2.0
2.33k stars 309 forks source link

intersects or not slow #288

Open MBkkt opened 1 year ago

MBkkt commented 1 year ago

I think main issue this todo https://github.com/google/s2geometry/blob/master/src/s2/s2boolean_operation.cc#L2255

Will it be done sometime?

jmr commented 1 year ago

Do you have test data that triggers it?

I'll try do to a new release soon and add the microbenchmarks. This will make it easier to reproduce/profile in isolation.

MBkkt commented 1 year ago

Not, it's customer data but I can describe it. https://github.com/google/s2geometry/discussions/290

X polygons is rectangles, and Y is any S2Loop

But I think it's concrete case in general I want to find some more effective way.

I found CrossingProcessor created every time, if exist way to create it single time and when clear it became better. And same for vectors with Edges starts_a, and starts_b.

I think it can be class which inherit from S2BooleanOperation. Something like S2CachedBooleanOperation which class can contains all this temporary fields and have Clear method.

In such case we at least can avoid allocate/free many times

MBkkt commented 1 year ago

@jmr

What do you think about add something like

Intersects(S2Polygon& lhs, S2Polygon& rhs) {
  if (lhs.num_loops() == 1 && rhs.num_loops() == 1) {
    return lhs.loop(0).Intersects(rhs.loop(0));
  }
  current expensive code

Same possible for Contains(Polygon&)

I think it's pretty common case when polygon contains only single loop? (Any single polygon without holes) Even separate ctor exists for this case.

jmr commented 1 year ago

It seems like it would be better if S2BooleanOperation could do it quickly. Then more code would benefit. I don't know about the feasibility of this. @smcallis

Not, it's customer data but I can describe it.

Can you make a similar, synthetic test case not derived from the customer data?

smcallis commented 1 year ago

+1 to S2BooleanOperation doing this faster, I believe we have it on the roadmap but I'm not sure when we'll get to it.

MBkkt commented 1 year ago

Sounds good if it will be faster.

I think I can provide some benchmarks later.

In general I tried to speedup arangodb searching geo data with inverted index (of course firstly I fixed not related to S2 issue).

And know I have two issues: 1) Slow S2BooleanOperation 2) I don't really sure what format on disk I should use, now It's just GeoJson, which we parse to S2 structure, it's not precise (store lat lng in degrees, not S2Point) and slow (needs allocation, a lot of conditions).

So I also have question about 2, will be nice if you have some advice.

The main pros of storing geo json it's more compact, especially for points (needs only 8*2 + 1 bytes). I have question: Is lat lng never used to storing S2Region/Shape because it has a precision lost? Or it's because you consider trigonometry translation degrees latlng to S2Point is more expensive?

Also now I implemented serialization with s2 Decoder/Encoder and S2Region (we use it for precise comparison, after finding terms from S2TermIndexer)

And I not sure, I see a lot of lazy classes which named like S2LaxPolygonShape, is it better to use?

But how to intersect/etc they?

smcallis commented 1 year ago

I'm working on new bulk queries, probably the intersection query will be what you want, it'll be substantially faster than S2BooleanOperation which effectively computes the intersection then tells you if it's empty or not. I'll be writing that here in Q1, I've got all the pieces written and just need to integrate and test.

As for formats, I'd recommend storing data directly in the encoded format. Build an S2Polygon (or S2LaxPolygonShape) and store that binary blob directly. We can decode those very quickly and lazily if need be.

I don't see how GeoJson can be smaller since it's text based, unless you're throwing away almost all your precision and storing just two decimal points. If you want to do that, better to "snap" your geometry to a given S2Cell level. Two digits of precision is ~a level 9 cell (source), so you could snap to level 9 and use COMPACT encoding and the encoder can use the fact that the points are snapped to reduce the size (they can be stored as integers, level 9 would have 3+18+1=22 bits/point).

S2LaxPolygon[Polyline]Shape are lax not lazy. They allow degeneracies such as degenerate edges (v0,v0), and sibling pairs (two reversed edges touching). They're meant to be more flexible and ultimately replace S2Polygon/S2Polyline, and they still work with things like S2BooleanOperation.

MBkkt commented 1 year ago

@smcallis

Thanks for detailed answer.

I don't see how GeoJson can be smaller since it's text based

We store GeoJson in binary representation, something like messagepack/bson/etc if you familiar (so it without some text overhead, like spaces, comma, etc)

In average we need 16 byte per point (because GeoJson describe points like lat lon), and few additional bytes per array to store then it starts/ends + constant like 40 bytes.

S2Polygon needs: 424 bytes bound 3 bytes own loops/has hole/format version markers 4 byte loops count Every loop: 1 byte version marker 4 bytes number of vertices 24 bytes per vertex 1 byte origin state 4 byte depth ? 4 24 bytes for bound

So it's extremely bad for small polygons

S2LaxPolygon Encode looks a lot of better It's just all vertices + marks of loops start

So it better then lat, lon encoding GeoJson before 15~20 vertices, then it also worse but it's understandable

Of course with CodingHint Compact it can be better for non convex polygons, but only a little

In general I think we should use S2LaxPolygon instead of S2Polygon, it looks a lot of better, but translation is not easy

MBkkt commented 1 year ago

S2LaxPolygon[Polyline]Shape are lax not lazy

I meant they have EncodedS2Lax* interface to decode them lazily

MBkkt commented 1 year ago

and they still work with things like S2BooleanOperation.

Yes, but they need to compute index.

For an example S2Polygon::Contains(Point) not always need to do it.

I think my main issue: I have operations like contains/intersects/etc

I want to compute exact yes or no.

And one operand from user query, for it I can do heavy precompute, and it's size doesn't really matter.

Another operand from index, in general from disk, and I already know it probably intersects (because I found it terms (terms from coverage)) and for index data I want it to be compact, and fast to extract and do operation with user data.

Simplified:

auto data1 = something heavy and precompute for (data2 in index where data2 terms intersects with data1 terms) { If (fast_intersects/contains/etc(data1, data2)) { push(data2); }

}

smcallis commented 1 year ago

In general I think we should use S2LaxPolygon instead of S2Polygon, it looks a lot of better, but translation is not easy

Yeah the Lax classes are implementations of the new Shape API which is more flexible in general, S2Polygon is really legacy at this point, I'd encourage you to use Shapes whenever possible.

I meant they have EncodedS2Lax* interface to decode them lazily

Ah yes they do, we have EncodedS2ShapeIndex as well so you can decode the index and the shapes in it lazily as well to minimize disk seeks. You can decode them all at once as well though.

For an example S2Polygon::Contains(Point) not always need to do it.

We have s2shapeutil::ContainsBruteForce that can check point containment without an index. Intersection and containment are the same for a polygon/point but merely checking for vertex containment and lack of edge crossings isn't sufficient for polygons because they have interior that can escape, so you have to verify orientation as well, which makes it a fair bit harder.

MBkkt commented 1 year ago

s2shapeutil::ContainsBruteForce

Thanks!