Open zhouchanghai opened 3 years ago
Do you have some dataset where this causes issues? I have wondered about this as well, while what you are saying is true, the distance is used to find the nearest neighbours. I don't have data on how many mislabeled points this causes, but since the labels are approximate anyway (since each district is described only by a single point), the slowdown from using some more complex metric might make it not worth the effort to use a different metric...
You could try to replace it with sklearn.neighbors.KDTree(..., metric="haversine")
.
EDIT: oh it seems haversine is not supported.
>>> KDTree.valid_metrics
['euclidean', 'l2', 'minkowski', 'p', 'manhattan', 'cityblock', 'l1', 'chebyshev', 'infinity']
For example the real distance between (0N, 0E) (0N, 1E) is 112Km, and the real distance between (80N, 0E) (80N, 1E) is 19Km (https://www.movable-type.co.uk/scripts/latlong.html). But they are the same in Minkowski distance. BTW, the latitude range is -90 to 90, but the longitude range is -180 to 180, different scale.