scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 496 forks source link

How to create clusters using haversine formula #29

Open Gudui opened 8 years ago

Gudui commented 8 years ago

I have some sample long/lat data i would like to have clustered, but i cannot seem to get any meaningful data out of using HDBSCAN. I'm surely doing something wrong.

hdb = HDBSCAN(min_cluster_size=3, metric='haversine').fit(sample_data) hdblabels = hdb.labels n_clustershdb = len(set(hdb_labels)) - (1 if -1 in hdb_labels else 0)

It gives me two clusters with my sample data of long lats(50 long/lats), rest is noise. How exactly do i get the measured distance between the points?

Say i want my distance between points to be ~500 meters, how exactly can i extract that from the clustered data? untitled

lmcinnes commented 8 years ago

It is quite possible you are not doing anything wrong. I've had some private reports of weirdness with Haversine. In reflection I suspect that perhaps it doesn't play well with the Boruvka algorithm given it's presumption of a compact manifold which I'm not sure Boruvka is going to handle in quite the way one would expect. To test that theory you could try:

hdb = HDBSCAN(min_cluster_size=3, metric='haversine', algorithm='prims_balltree').fit(sample_data)
hdb_labels = hdb.labels_
n_clusters_hdb_ = len(set(hdb_labels)) - (1 if -1 in hdb_labels else 0)

and see if that gives more sensible results. I suspect it may. If it does then I'll have to add some checks to ensure Haversine uses Prims. If it doesn't ... Haversine may just be broken. I'll have to look into it further.

thomasht86 commented 7 years ago

I know this is an old thread, but thought I would chime in anyway. The haversine metric requires radians instead of lat/lon, and so the epsilon also have to be calculated as meters / metersperradian. If we want an eps that corresponds to 2000m, we need to input: ms_per_radian = 6373000.0 eps = 2000 / ms_per_radian

For the lat/lon-data, the simplest is to just convert to radians by using:

X = np.radians(X)

lmcinnes commented 7 years ago

Thanks, that's actually very helpful!

On Fri, Mar 10, 2017 at 2:52 AM, thomasht86 notifications@github.com wrote:

I know this is an old thread, but thought I would chime in anyway. The haversine metric requires radians instead of lat/lon, and so the epsilon also have to be calculated as meters / metersperradian. If we want an eps that corresponds to 2000m, we need to input: ms_per_radian = 6373000.0 eps = 2000 / ms_per_radian

For the lat/lon-data, the simplest is to just convert to radians by using:

X = np.radians(X)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/29#issuecomment-285601753, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBfG7t4MgRNucHVSuVwRLQXS7AoJgks5rkQEtgaJpZM4H0v9V .

iremozen-edremit commented 5 years ago

I know this is an old thread, but thought I would chime in anyway. The haversine metric requires radians instead of lat/lon, and so the epsilon also have to be calculated as meters / metersperradian. If we want an eps that corresponds to 2000m, we need to input: ms_per_radian = 6373000.0 eps = 2000 / ms_per_radian

For the lat/lon-data, the simplest is to just convert to radians by using:

X = np.radians(X)

Thanks, that's actually very helpful! On Fri, Mar 10, 2017 at 2:52 AM, thomasht86 @.***> wrote: I know this is an old thread, but thought I would chime in anyway. The haversine metric requires radians instead of lat/lon, and so the epsilon also have to be calculated as meters / metersperradian. If we want an eps that corresponds to 2000m, we need to input: ms_per_radian = 6373000.0 eps = 2000 / ms_per_radian For the lat/lon-data, the simplest is to just convert to radians by using: X = np.radians(X) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#29 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBfG7t4MgRNucHVSuVwRLQXS7AoJgks5rkQEtgaJpZM4H0v9V .

however, eps cannot be applied on HDBSCAN algorithm. This usage is only for DBSCAN. is it right?

lmcinnes commented 5 years ago

Yes, eps is not relevant in this case, but the conversion to radians remains important.