tidwall / tile38

Real-time Geospatial and Geofencing
https://tile38.com
MIT License
9.15k stars 570 forks source link

Benchmark between [ standard overlap+Haversine algorithm] and [KNN algorithm] #644

Open devShahriar opened 2 years ago

devShahriar commented 2 years ago

Hello. For the nearby lookup, tile38 uses the KNN algorithm instead of [standard overlap+Haversine algorithm].

Is there any benchmark or comparison done between these two approaches?

I would like to know about the benefits of using the KNN algorithm for radius lookup.

tidwall commented 2 years ago

Hi,

The kNN algo will likely be a little slower than overlap+haversine.

This is mainly because the kNN requires the use of a priority queue data structure that tracks the distance to the branch rectangles, recursively. Those distances are performed using a geodesic formula that is a little more processor intensive.

The overlap+haversine, on the other hand, is a much simpler algo because it only requires generating a bounding rectangle of the radius circle, then querying the r-tree for all points contained in that rect, finally using haversine to determine if the point is within the target distance.

The NEARBY command always uses kNN, for overlap+haversine you can use WITHIN+CIRCLE.

For example, these two commands will return the same data.

NEARBY fleet POINT 33 -112 5000
WITHIN fleet CIRCLE 33 -112 5000

Is there any benchmark or comparison done between these two approaches?

I don't have a built in benchmark. Perhaps the redis-benchmark tool can help.

redis-benchmark -p 9851 -q NEARBY fleet POINT 33 -112 5000
redis-benchmark -p 9851 -q WITHIN fleet CIRCLE 33 -112 5000

I would like to know about the benefits of using the KNN algorithm for radius lookup.

For NEARBY+POINT the results are always ordered by distance, nearest to farthest. For WITHIN+CIRCLE the order of the results is undefined.

The power comes when using NEARBY with a LIMIT. Since that is what the "k" in kNN is for.

For example, let's say you need to find the nearest 1,000 in a 100 km radius. But your collection may have millions of points that are in that same area.

You can run:

NEARBY fleet LIMIT 1000 POINT 33 -112 100000

Which efficiently queries the collection and returns back the first 1,000 points nearest to 33,-113, but will not extend past the 100 km. The kNN algo knows exactly when to stop and avoid further computation, and all results will be in perfect order.

Or if you don't care about the bounding radius, you can forego the 100 km and just do:

NEARBY fleet LIMIT 1000 POINT 33 -112

While this:

WITHIN fleet LIMIT 1000 CIRCLE 33 -112 100000

Will simply search for the first 1000 points and stop. The order of the results are determined by whatever the underlying r-tree layout returns.

No doubt WITHIN will be faster than NEARBY.

So if you are not using a LIMIT or you don't care about the order of the results, then use WITHIN+CIRCLE, otherwise use NEARBY.

devShahriar commented 2 years ago

Thanks for the explanation.

For my use case . I want to stop at a certain radius .

For example I have location ping around 300 km radius. But I want to get all data point within 2km radius for certain lat/lon and I dont want to do any further computation after 2km in that case KNN is faster than haversine right ?

Suppose in elastic search if you have a radius query for 2km elastic will calculate the haversine for all data points and return them in sorted order and than it returns data point with 2km threshold. I dont want my system to compute that extra haversine for all data points after 2km radius .

In the case above knn algo will perform faster than haversine and ordering right ?

tidwall commented 2 years ago

In the case above knn algo will perform faster than legacy haversine and ordering right ?

I would think the other way around. The kNN uses a more complex geodesic algorithm that is slower than haversine, but is accurate for branch rectangles on a sphere (which is needed because haversine only works with points).

I don't really consider haversine legacy. The formula is used in WITHIN+CIRCLE and is very fast, it can handle many millions of operations per second.

I recommend experimenting with the two options and seeing which one works best for you.

devShahriar commented 2 years ago

Thanks again.