mapbox / which-polygon

Index for matching points against a set of GeoJSON polygons
ISC License
133 stars 14 forks source link

Investigate faster data structures #1

Closed mourner closed 8 years ago

mourner commented 8 years ago

Rather than do naive ray casting + rbush, we should try other approaches for faster queries:

mikolalysenko commented 8 years ago

Regarding performance, it should be possible to speed up point-in-region a bit if robustness isn't an issue. I could maybe take a stab at this on the weekend.

mourner commented 8 years ago

@mikolalysenko would be great to see how it compares! In addition to non-robust predicates, you could also try non-immutable BST, e.g. a simple array-based one like this https://github.com/mourner/bbtree/blob/master/bsarray.js

mikolalysenko commented 8 years ago

@mourner the problem with that is that it would take O(n^2) space here. The reason an immutable BST is used is that it allows structural sharing when it is constructing by scanning. Otherwise you would just be doing a plain old slab decomposition.

mikolalysenko commented 8 years ago

With the functional BST you can do the queries in O(n log(n)) space

mourner commented 8 years ago

@mikolalysenko tried point-in-region module but it doesn't seem to work properly (e.g. [50.5, 30.5] results in null instead of Ukraine): https://github.com/mapbox/which-polygon/tree/point-in-region. Not sure if I'm doing it right.

mikolalysenko commented 8 years ago

Yeah, that module has a few bugs. point-in-big-polygon should work, but I really need to give this whole thing a closer look.

mourner commented 8 years ago

@mikolalysenko what do you think about PM Quadtrees by Hanan Samet — is that the same quadtree-based algo you described, just with some extentions? It's a pretty old idea (1985), though the info/code on it is very scarse and it's not used extensively in the wild (at least MSSQLSpatial seems to use it). It may be very fast for point in region queries.

mikolalysenko commented 8 years ago

@mourner it is a good data structure for many common situations, but the worst case performance is bad and it can fail in pathological situations (like if many lines intersect at the same point). The way that point-in-region solves the problem has guaranteed O(log(n)) query time, while PM-QuadTrees can degrade in some bad situations.

Still if your data is suitably nice it is something which is worth looking into.

mourner commented 8 years ago

@mikolalysenko OK, I see. So are you going to look into the point-in-region bugs? Since my use case is just user GeoJSON with polygons, it may contain polygons that intersect itself and each other (in some minor ways) and have degeneracies, and I'd still want to have a reliable point-in-poly tests when querying in non-overlapping/non-degenerate areas.

mikolalysenko commented 8 years ago

Ah, I see. Might be possible to fix that up using clean-pslg, though resolving all these intersections might take a long time.

I bet this is also what is breaking point location for you in that issue.

mikolalysenko commented 8 years ago

A question though: In the case of self intersections how would you like to handle location queries? Do you want to return a list of all regions containing the point?

mourner commented 8 years ago

Right, I see. There's no rush looking into this since the current naive approach works perfectly for my use cases already (it can query 50k points in 1ms with a 4MB world countries GeoJSON).

Regarding intersections and overlaps, even-odd rule like in ray casting works very well since self-intersections and hole/outer overlaps in user data are usually introduced by simplifying (i.e. with Douglas-Peucker) and are pretty minor, and I'm currently just returning the first matching polygon if there are multiple.

Given that, perhaps it's still worth looking into PM quadtrees since extremely bad data situations would not occur much in practice probably (for geospatial datasets like administrative boundaries), and it would still rely on ray-casting in the end, handling edge cases gracefully.

mikolalysenko commented 8 years ago

Yeah, might be fun to try something with PMR quadtrees. Still I would like to see how point-in-region stacks up, and also it has been in my back log a long time to go back and clean it up a bit.

mourner commented 8 years ago

A much simpler idea just from the top of my mind is to additionally index each polygon in a Y range tree so that point-in-poly checks have to do O(log n) comparisons in average instead of O(n).

mourner commented 8 years ago

It's good enough for now, so closing.