rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.24k stars 532 forks source link

[FEA] Need a approximate_predict function for cuml HDBSCAN #4448

Closed sudhanshu-shukla-git closed 2 years ago

sudhanshu-shukla-git commented 2 years ago

Is your feature request related to a problem? Please describe. I wish I could use cuML HDBSCAN to do predicting the clusters from the existing model, similar to the scikit-learn's approximate_predict

Describe the solution you'd like Similar to scikit-learn HDBSCAN's approximate_predict https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict

Predict the cluster label of new points. The returned labels will be those of the original clustering found by clusterer, and therefore are not (necessarily) the cluster labels that would be found by clustering the original data combined with points_to_predict, hence the ‘approximate’ label.

Describe alternatives you've considered

There is a CPU based solution available already by Scikit, but need a GPU based solution.

https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

sudhanshu-shukla-git commented 2 years ago

@cjnolet Do we have any updates on this feature? When can we expect this to be released?

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

whymauri commented 2 years ago

(commenting to maintain the issue as active)

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

sudhanshu-shukla-git commented 2 years ago

(commenting to maintain the issue as active)

cedivad commented 2 years ago

I was also looking for this feature. I assume the models aren't binary-compatible and we can't use a model created by cuml for say scikit-learn's approximate_predict?

whymauri commented 2 years ago

Technically you can extract the required datastructures from RAPIDS and inject them into SKLearn's HierarchicalLabelTree.

But you will have to do a lot of implementation on your own end.

osalem-l commented 2 years ago

Technically you can extract the required datastructures from RAPIDS and inject them into SKLearn's HierarchicalLabelTree.

But you will have to do a lot of implementation on your own end.

Could you illustrate how? currently I'm trying to figure this out

cedivad commented 2 years ago

I've looked at SKLearn's implementation and it seems they are using a brute force approach, calculating distances to each centroid one by one. On a GPU, I'm thinking yes, you could parallelise the distance calculations but you would still need to check the results one by one. Best case you would spawn a "binary tree" of checking threads. I believe this is a task that isn't very parallelizable, and maybe that's why it was de-prioritized?

If so we only need to extract the centroids from RAPIDS and use them in whatever code we want, say a small go http server for inference of new vectors.

RaiAmanRai commented 2 years ago

Hi @cjnolet @divyegala any updates on this feature, or any appromimate timeline when this will roll out.

Would really appreciate the work.

ldsands commented 2 years ago

Hi @cjnolet @divyegala any updates on this feature, or any appromimate timeline when this will roll out.

Would really appreciate the work.

Does this pull request not add this feature? I haven't dived in deep to see but just glancing it looks like it does. At the very least, this pull request is needed before the approximate_predict feature can be implemented.

cjnolet commented 2 years ago

@sudhanshu-shukla-git @RaiAmanRai

Does https://github.com/rapidsai/cuml/pull/4800 not add this feature? I haven't dived in deep to see but just glancing it looks like it does. At the very least, this pull request is needed before the approximate_predict feature can be implemented.

That pull request implements the needed pieces for fuzzy clustering, which is a stepping stone towards out of sample prediction (approximate_predict). We're working towards the approximate predict.

DeepTitan commented 2 years ago

I second the need for this feature, would really help in my project