scikit-mobility / scikit-mobility

scikit-mobility: mobility analysis in Python
https://scikit-mobility.github.io/scikit-mobility/
BSD 3-Clause "New" or "Revised" License
716 stars 157 forks source link

preprocessing.clustering: include support for latitudes far from equator #195

Open vlingenfelter opened 3 years ago

vlingenfelter commented 3 years ago

Currently the DBSCAN relies on a constant that assumes the dataset is at the equator. I would like to see this extended so that it does not default to the equator, perhaps using an average latitude value over the dataset?

Constant used for calculations:

kms_per_radian = 6371.0088   # Caution: this is only true at the Equator!
                             # This may cause problems at high latitudes.

Used later in calculation:

# Compute DBSCAN
eps_rad = cluster_radius_km / kms_per_radian

db = DBSCAN(eps=eps_rad, min_samples=min_samples, algorithm='ball_tree', metric='haversine')

Where cluster_radius_km defaults to 0.1 but can be set by the user.

gbarlacchi commented 3 years ago

@FilippoSimini @jonpappalord what do you think? How can we address this problem?

vlingenfelter commented 3 years ago

An idea for addressing this: Instead of using that fixed constant that assumes the equator, we could take some sort of average latitude over the given set of points, and use that to calculate the average kilometers per degree/radian in longitude for that particular set. This would still be a little bit inaccurate/messy, but wouldn't make the accuracy vary based on distance from the equator.

Potential method for reference: https://gis.stackexchange.com/questions/142326/calculating-longitude-length-in-miles

I'm happy to contribute to fixing this! I'm working on a project that spans the entirety of CA (spanning ~32 to ~42 degrees N) and I'm trying to avoid a bias in how clusters are calculated in Northern vs. Southern California.

FilippoSimini commented 3 years ago

I think the ideal solution to this would be to allow users to specify their own distance function, which could be from another library like geopy.distance.distance.

vlingenfelter commented 3 years ago

Perhaps another solution would be to instead ask the user to use a projected coordinate system that is based in meters (I believe a similar issue/concept is being discussed in #185), and then default to Haversine distance with lat, lon like the rest of the library if using a geographic coordinate system?

gbarlacchi commented 1 year ago

Update: due to an internal code refactoring of core modules, we are pausing the developments for the current issue.