scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 504 forks source link

Model persistence #38

Open umut-toprak opened 8 years ago

umut-toprak commented 8 years ago

Hello,

I would like to save a trained model using pickle and then use the predict method on new data points to predict the cluster membership probabilities for each new data point. So far, I found only the fit_predict method which is supposed to alter the model.

Do you think this feature (which might be useful for many I guess) is easy to implement?

Thanks a lot!

lmcinnes commented 8 years ago

I have actually put some thought into this and there are a few options. One of the problems is exactly how to do this and fit well with the sklearn API. One method is to maintain the minimum spanning tree, which you can then update with new points to get a new minimum spanning tree and then perform the tree condensation and cluster extraction. This is "ideal" in the sense that it will give you a result identical to what you would get if you had clustered the original points plus all the new points at once, without having to re-cluster everything from scratch (particularly good for large initial models). The catch is that the result will potentially change how the original points cluster (new points may cause clusters to merge, or otherwise change) and amounts to a complete relabelling of all points. This means labels of new points will not meet/match to the original model's labels; this is probably not what you want. One could return the full relabelling, but that doesn't match the sklearn API for predict.

One could return labels that match the original labelling, but then you would essentially be finding the "nearest cluster" and not actually be performing HDBSCAN at all. I don't think that really offers much.

I'm interested to know what people feel the "right" solution might be in this situation.

umut-toprak commented 8 years ago

Thank you for your response. After your explanation, I agree that this is not a trivial change.

From my perspective and for my use case, the predict method would be used on a single new data point and an alteration of the underlying model as in "The catch is that the result will potentially change how the original points cluster (new points may cause clusters to merge, or otherwise change) and amounts to a complete relabelling of all points." would not be appropriate. However, this is again from my perspective...

"One could return labels that match the original labelling, but then you would essentially be finding the "nearest cluster" and not actually be performing HDBSCAN at all. I don't think that really offers much." Do you think this could be expressed in probabilistic terms (as in fuzzy clustering) rather than a distance? If yes, that would be extremely useful for me, again with the caveat that I cannot speak for others.

Thanks a lot again for your interest.

PS: I closed the issue by mistake, sorry

lmcinnes commented 8 years ago

Probabilistic terms are what you'll get automatically as HDBSCAN (at least in this implementation) supports soft clustering. I think, in practice, all you really want to do is a k-nearest-neighbor query, get the labels of the k-neighbours and do a distance weighted vote on what the label of the new point should be. This doesn't even require you to save the model, just the labels, and then use the kNN tools from sklearn. I think that's pretty straightforward to implement as a function. I can do that for you if you need, but I imagine you may be able to work through that yourself.

The catch with this, of course, is that it can give you bad results for various pathological cases, and gets progressively less accurate in higher dimensions (as the curse of dimensionality pushes points to the corners and boundaries). You should be a little wary.

I do think this is a feature that other people do want, and I am keen to implement it, but I really want to get it right rather than just providing "something".

jc-healy commented 8 years ago

I feel compelled to mention that when one says probability one needs to think about the question of "the probability of what happening?". The .probability_ property hat we have is definitely a score between zero and one. The score translates to the proportion of a clusters lifespan that the point remains within the cluster.

This is a fantastic soft score for determining strength of cluster membership but it shouldn't be treated as a probability.

On Mon, May 30, 2016 at 3:22 PM, Leland McInnes notifications@github.com wrote:

Probabilistic terms are what you'll get automatically as HDBSCAN (at least in this implementation) supports soft clustering. I think, in practice, all you really want to do is a k-nearest-neighbor query, get the labels of the k-neighbours and do a distance weighted vote on what the label of the new point should be. This doesn't even require you to save the model, just the labels, and then use the kNN tools from sklearn. I think that's pretty straightforward to implement as a function. I can do that for you if you need, but I imagine you may be able to work through that yourself.

The catch with this, of course, is that it can give you bad results for various pathological cases, and gets progressively less accurate in higher dimensions (as the curse of dimensionality pushes points to the corners and boundaries). You should be a little wary.

I do think this is a feature that other people do want, and I am keen to implement it, but I really want to get it right rather than just providing "something".

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/hdbscan/issues/38#issuecomment-222561027, or mute the thread https://github.com/notifications/unsubscribe/ALaKWpdwhRBLdn8woz8eKIoE-JQpDZRzks5qG2MigaJpZM4IpAQX .

lmcinnes commented 7 years ago

I have now merged the experimental implementation into master. The clusterer now takes a prediction_data keyword (set it to true to generate the data required for prediction) and after that you can use a function

approximate_predict

to predict the clusters of new points. See the doctsrings of the function for usage in the meantime while I get some proper documentation and tutorials written.

kavin26 commented 6 years ago

hi @lmcinnes when cluster label for new data point is predicted, will it be compared to all core points in the existing clusters or any specific reference points (like centroids in k-means)?

lmcinnes commented 6 years ago

It is compared to the structure of the existing clustering, so yes all the core points of the existing clusters.

saif-freestar commented 6 years ago

maybe a little late to this thread..but if we can make this give user an option to extract the determined label centroids..then the user can simply take away those centroids and run a k-nn whenever a new point shows up to classify it. thoughts?