mapado / haversine

Calculate the distance between 2 points on Earth
MIT License
326 stars 62 forks source link

haversine_vector is no longer fast #63

Closed jobh closed 1 year ago

jobh commented 1 year ago

Following commit b949e32246976f14f192c97c9bb0b82b8e1ee451, which introduces element-wise limits check, the vectorized version is 10 times or more slower than it used to be.

There should probably be vectorized versions of the check, or a flag to indicate that limits check is not needed.

jdeniau commented 1 year ago

Hi @jobh .

It might come from the else case here does not use numpy.array

https://github.com/mapado/haversine/commit/b949e32246976f14f192c97c9bb0b82b8e1ee451#diff-aac46069c41f551819527065ee5717c7767fa1a3ec7fb2b4cde307d1a01e78c3R168-R173

How many data points do you have though ? Because this method is a O(n) complexity, which should not be that huge (even if we can do better).

Can you provide a reproductible case ?

Thanks

jobh commented 1 year ago

Sure!

$ pip install --upgrade haversine==2.5.1 && \
    python -m timeit \
    -s 'import numpy, haversine; a=numpy.zeros((1000,2))' \
    'haversine.haversine_vector(a, a)'
[...]
Successfully installed haversine-2.5.1
10000 loops, best of 5: 26.3 usec per loop

If I change to 2.7.0, I get

[...]
Successfully installed haversine-2.7.0
500 loops, best of 5: 474 usec per loop

My typical input size is 100-1000 locations, where I want distance between neighbours: haversine_vector(a[1:], a[:-1]).

In the code, both branches are slow since they do a function call per location. Vectorized versions might be something like the following (untested) which can be called single-shot instead of per-location. OTOH, a separate flag (...check=True) which can be turned off may be an option, for those who know the data is good and need speed.

def _ensure_lat_lon_vector(lats, lons):
    if numpy.any(numpy.abs(lats) > 90):
        raise ValueError(f"Latitude is out of range [-90, 90]")
    if numpy.any(numpy.abs(lons) > 180):
        raise ValueError(f"Longitude is out of range [-180, 180]")

and

def _normalize_vector(lats, lons):
    lats = (lats + 90) % 360 - 90
    idx = lats > 90
    lats[idx] = 180 - lats[idx]
    lons[idx] += 180 # is this right?
    lons = (lons + 180) % 360 - 180
    return lats, lons
jobh commented 1 year ago

...and (to explain my motivation): I do this for thousands of tracks, to filter out those with too large distance between neighbouring locations. So that half millisecond quickly adds up to several seconds.