Closed sudhanshu-shukla-git closed 2 years ago
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
@cjnolet Do we have any updates on this feature? When can we expect this to be released?
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
(commenting to maintain the issue as active)
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
(commenting to maintain the issue as active)
I was also looking for this feature. I assume the models aren't binary-compatible and we can't use a model created by cuml for say scikit-learn's approximate_predict?
Technically you can extract the required datastructures from RAPIDS and inject them into SKLearn's HierarchicalLabelTree.
But you will have to do a lot of implementation on your own end.
Technically you can extract the required datastructures from RAPIDS and inject them into SKLearn's HierarchicalLabelTree.
But you will have to do a lot of implementation on your own end.
Could you illustrate how? currently I'm trying to figure this out
I've looked at SKLearn's implementation and it seems they are using a brute force approach, calculating distances to each centroid one by one. On a GPU, I'm thinking yes, you could parallelise the distance calculations but you would still need to check the results one by one. Best case you would spawn a "binary tree" of checking threads. I believe this is a task that isn't very parallelizable, and maybe that's why it was de-prioritized?
If so we only need to extract the centroids from RAPIDS and use them in whatever code we want, say a small go http server for inference of new vectors.
Hi @cjnolet @divyegala any updates on this feature, or any appromimate timeline when this will roll out.
Would really appreciate the work.
Hi @cjnolet @divyegala any updates on this feature, or any appromimate timeline when this will roll out.
Would really appreciate the work.
Does this pull request not add this feature? I haven't dived in deep to see but just glancing it looks like it does. At the very least, this pull request is needed before the approximate_predict feature can be implemented.
@sudhanshu-shukla-git @RaiAmanRai
Does https://github.com/rapidsai/cuml/pull/4800 not add this feature? I haven't dived in deep to see but just glancing it looks like it does. At the very least, this pull request is needed before the approximate_predict feature can be implemented.
That pull request implements the needed pieces for fuzzy clustering, which is a stepping stone towards out of sample prediction (approximate_predict). We're working towards the approximate predict.
I second the need for this feature, would really help in my project
Is your feature request related to a problem? Please describe. I wish I could use cuML HDBSCAN to do predicting the clusters from the existing model, similar to the scikit-learn's approximate_predict
Describe the solution you'd like Similar to scikit-learn HDBSCAN's approximate_predict https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict
Predict the cluster label of new points. The returned labels will be those of the original clustering found by clusterer, and therefore are not (necessarily) the cluster labels that would be found by clustering the original data combined with points_to_predict, hence the ‘approximate’ label.
Describe alternatives you've considered
There is a CPU based solution available already by Scikit, but need a GPU based solution.
https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict