KDtree in scipy.spatial is using Minkowski distance which is not suitable for latitude/longitude

thampiman / reverse-geocoder

A fast, offline reverse geocoder in Python

GNU Lesser General Public License v2.1

1.87k stars 160 forks source link

KDtree in scipy.spatial is using Minkowski distance which is not suitable for latitude/longitude #59

Open zhouchanghai opened 3 years ago

zhouchanghai commented 3 years ago

For example the real distance between (0N, 0E) (0N, 1E) is 112Km, and the real distance between (80N, 0E) (80N, 1E) is 19Km (https://www.movable-type.co.uk/scripts/latlong.html). But they are the same in Minkowski distance. BTW, the latitude range is -90 to 90, but the longitude range is -180 to 180, different scale.

BoZenKhaa commented 3 years ago

Do you have some dataset where this causes issues? I have wondered about this as well, while what you are saying is true, the distance is used to find the nearest neighbours. I don't have data on how many mislabeled points this causes, but since the labels are approximate anyway (since each district is described only by a single point), the slowdown from using some more complex metric might make it not worth the effort to use a different metric...

Dobatymo commented 1 year ago

You could try to replace it with sklearn.neighbors.KDTree(..., metric="haversine").

EDIT: oh it seems haversine is not supported.

>>> KDTree.valid_metrics
['euclidean', 'l2', 'minkowski', 'p', 'manhattan', 'cityblock', 'l1', 'chebyshev', 'infinity']