scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 497 forks source link

Can I do Spatio-temporal clustering with HDBSCAN ? #307

Open Stephane-Thales opened 5 years ago

Stephane-Thales commented 5 years ago

Hi All,

I've a dataset with latitude, longitude and date-time. Is there a way to do a spatio-temporal clustering that includes the 3 features? Havresine is accepting only 2 features (lat lon in radians). I can scale/normalize the 3 features and use Euclidian, but I believe that I will lose the impact of the physical distance between points.

Any recommendation?

Thanks

jc-healy commented 5 years ago

The hard bit here is to work out how distance in seconds relates to distance in radians. Those are fundamentally two totally different scales. Any density based clustering technique is going to require a consistent distance between your points. If you had few enough points you could compute the Haversine distance between points and the temporal distance between your points and fold those together via a scaling that you felt was an appropriate between radian and seconds. Then you could hand that to hdbscan as a pre-computed distance matrix. Just remember that your scaling will change your clustering results.

If you wanted to go deeper down the rabbit hole of combining data with different concepts of distance on different sets of it's features then I'd suggest starting off by reading The 2016 paper "Generalized Low Rank Models" by Udell, et all. Don't be intimidated by the size of the paper it's a very smooth read with lots of examples.

gewitterblitz commented 1 year ago

I have a time distance matrix to create a temporal clusterer. What's the best way to chain it with the Euclidean distance based clusterer to obtain spatio-temporal clustering?

gewitterblitz commented 1 year ago

Nevermind, I figured it out.

Stephane-Thales commented 1 year ago

Nevermind, I figured it out.

Hi, @gewitterblitz , do you mind sharing code template for the time-distance matrix and your spatio-temporal clustering solution ?

gewitterblitz commented 1 year ago

@Stephane-Thales: I tried the following method but I am not sure if it's the best way.

import hdbscan
from scipy.spatial.distance import pdist

# data is of the form ['time','longitude','latitude']

def compute_squared_EDM(X):
    return squareform(pdist(X, metric='euclidean'))

hdb = hdbscan.HDBSCAN(metric='precomputed',cluster_selection_epsilon=20,min_cluster_size=5)
n, m = data.shape
timeDisMat=compute_squared_EDM(data[:,0].reshape(n, 1)) 
yy = hdb.fit(timeDisMat)
yy.cluster_selection_epsilon = 1e-5
yy.min_cluster_size = 5
disMat = compute_squared_EDM(data[:,1:3])
zz = yy.fit(disMat)

FWIW: This did not help me much because my data had some points too close to each other in space but distant in time and the hdbscan output still clustered the time-distant points into one cluster. Let me know if you have found something better.

Stephane-Thales commented 1 year ago

Hi @gewitterblitz ,

Is it possible that your 2nd .fit() is overwriting the 1st and all the part with timeDisMat is lost? DBScan does not have a partial_fit() feature only MiniBatchKMeans have one: https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning

Above, jc-healy mentioned:

compute the Haversine distance between points and the temporal distance between your points and fold those together via a scaling that you felt was an appropriate between radian and seconds.

So: a) finding an appropriate rescaling for the 2 matrixes timeDisMat and disMat b) fusing the matrix. In R there is a fuse function, can't find in python, so you can do sqrt(timeDisMat^2 + disMat^2) as squared Euclidean distances are additive according to here c) running the dbscan once

I've no time to try... but if you have, please share your result and code ?

gewitterblitz commented 1 year ago

Yes, you are right. The first just gets overwritten instead of acting on top of the first cluster classification.

I tried your suggestion but still don't get convincing results. It's most likely due to the uneven scaling of the time and distance dimensions. Not sure if there's an easy way to figure out the correct rescaling method.

I tried 'normalization' ((x-x.min)/(x.max-x.min)) and the StandardScaler function from sklearn.preprocessing but both of these didn't help much.

Stephane-Thales commented 1 year ago

Perhaps the data need to be normalized also if the shapes are too different ?

using StandardScaler ?