mikolalysenko / ama

Ask me anything!
8 stars 0 forks source link

What's the best 2d point index for fixed-radius queries? #15

Open mourner opened 8 years ago

mourner commented 8 years ago

@mikolalysenko Hi Mikola, cross-posting this question from gisdevs algorithms room to make sure you see it.

On your static-kdtree repo, you state that for fixed radius search, grids are a strictly better option for indexing 2d points. However I can’t seem to come up with a way to implement them efficiently in JS if the radius is low compared to the data space. E.g. if we have points spread out over the whole world but the query radius we need is 50 meters, the number of grid cells would be enormous and most cells would be empty, leading to a huge memory waste, not talking about the inefficiency of V8 sparse arrays or big dictionaries.

What do you think is the best way to index points for fixed small radius queries? Specifically, I'm interested in this for density-based clustering, which is the case I'm exploring in supercluster, but there are many more application where this kind of index is useful in geospatial.

RBush with it’s top-down packing generally works well for point data, but I wonder if there are more efficient data structures that would be fast to construct and fast to query with low memory footprint specifically for the 2d use case.

mikolalysenko commented 8 years ago

Don't store the empty cells. Instead you should store just a list of points sorted in lexicographic order by their bin index. In this way you can locate a point in the grid by binary search in O(log(n)) time.

Alternatively, in 2D you can just use a Delaunay triangulation + point location data structure, which gets you the same complexity and has the benefit that it can answer pretty much any other kind of query you would like (eg halfspace, variable sized disk, closest point etc.).

mikolalysenko commented 8 years ago

Here is an example of using this technique to resolve collisions between particles:

https://github.com/mikolalysenko/n-body-pairs

mourner commented 8 years ago

Wow, I feel so dumb not thinking about sorting and binary search. Sounds promising. For a radius-based query, I would do 9 binary searches (current + adjacent cells), and each binary search would be a bit more complicated since it needs to return multiple values (e.g. 2 searches, one for lower bound and one for upper bound), right? I'll try a quick proof of concept tomorrow.

Regarding Delaunay triangulation, isn't it computationally much more expensive than something like a radix sort?

mikolalysenko commented 8 years ago

For preprocessing, it is marginally slower. Asymptotically building a 2D Delaunay triangulation is about the same as the cost of sorting a set of points (in fact you can use a radix sort to do a Delaunay triangulation for floating point inputs).

Once it is built though, querying the Delaunay triangulation is faster.

mikolalysenko commented 8 years ago

Also you can speed up the 9 binary searches using cascading. This is related to the problem of intersecting two sorted lists. What you do is you start by binary searching on the upper/lower bounds of the 9 query bin indices, then you recurse down the list in batches. If you encounter a split point, then you separate the points and recursively search with the two sub arrays.

mourner commented 8 years ago

Right, I think I've seen a similar technique for range search optimization on z-order curves.

About Delaunay, would storage requirement be a problem though? You would need 3*N indices array as opposed to no additional storage when you sort in place.

mikolalysenko commented 8 years ago

You can store the data for a Delaunay triangulation compactly using a variety of topological data structures, but 3N integers = 12 \ N bytes isn't that much data.

mikolalysenko commented 8 years ago

Also DT can answer more queries than a sorted list. You can do variable sized radius, nearest neighbors, half spaces, etc. with one unified data structure.

mourner commented 8 years ago

Interesting! Where you referring to this algorithm for sort-like performance of DT? http://www.s-hull.org/paper/s_hull.pdf What's your fastest DT implementation for 2d, incremental-delaunay? How does the performance compare to native sort?

mikolalysenko commented 8 years ago

No, incremental-delaunay is actually pretty bad and I would not recommend using it. CDT2D should be a faster, though in both cases I haven't optimized them yet. In principle CDT2D can be made to run O(sort(n)) time.

mikolalysenko commented 8 years ago

Also this is the theoretical result I was referring to: http://www.cs.princeton.edu/~wmulzer/pubs/wramdtFOCS.pdf

There are also some other ways to build planar DTs in sort(n). Another reduction is possible via compressed quadtrees: http://www.tcs.fudan.edu.cn/rudolf/Courses/Seminar/Seminar_11s/SODA11/loeffler_quadtree_soda11.pdf