DBSCAN callable metric shouldn't force float matrix data

scikit-learn / scikit-learn

scikit-learn: machine learning in Python

https://scikit-learn.org

BSD 3-Clause "New" or "Revised" License

60.01k stars 25.39k forks source link

DBSCAN callable metric shouldn't force float matrix data #1500

Closed nournia closed 11 years ago

nournia commented 11 years ago

I want to cluster records of data that are not in float matrix form and also there isn't any feature vector for each record. Hopefully DBSCAN clustering algorithm can use callable similarity function but:

model = DBSCAN(metric=mysimilarityfunc)
model.fit(data)

tries to convert whole matrix into float type and rises this error:

ValueError: could not convert string to float

Is there any solution for this problem?

amueller commented 11 years ago

Yes, I think there is. Use metric='precomputed' and pass the distance matrix instead of the data. That should do the trick.

nournia commented 11 years ago

Thanks, but I can't do that. My data is large and I can't afford to O(n^2) cost of distance matrix. Actually I don't know is DBSCAN the right algorithm. That was my only choice because I didn't find any other clustering algorithm in scikit-learn that works without whole features or distance matrix. Entries aren't representable in feature space and I wrote a special similarity function for pairs.

amueller commented 11 years ago

Ok, so that is a whole different story then. Our implementation of the DBSCAN definitely computes the whole dissimilarity matrix. How many samples to you have? The only clustering algorithm in scikit-learn that supports out of core computations is minibatch k-means. And that doesn't work with precomputed dissimilarities. From the top of my head I don't really know any algorithms that would work well in you setting. Maybe try some core-set approach? Most clustering algorithms are at least quadratic in time complexity, so you would have to wait quite a long time any how. Would a long run-time be ok for you?

Btw, closing this as the title of the issue is not really your problem. Maybe go to metaoptimize and ask about out of core algorithms for arbitrary distance measures.

nournia commented 11 years ago

Thanks.