rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.21k stars 530 forks source link

[FEA] OPTICS clustering #5982

Open astomer2 opened 3 months ago

astomer2 commented 3 months ago

Is your feature request related to a problem? Please describe.

I am currently working with a dataset consisting of 20,000 sequences, and I need to perform distance-based clustering on these sequences. However, using HDBSCAN for this task results in an excessive number of noise clusters, which is not desirable for my analysis. I wish I could use cuML to perform OPTICS clustering, as it would be more suitable for my use case and potentially provide better clustering results.

Describe the solution you'd like

I would like to request the implementation of the OPTICS clustering algorithm in cuML. OPTICS is known for its ability to handle varying densities in the dataset and should provide more meaningful clusters without excessive noise compared to HDBSCAN. Having OPTICS in cuML would allow for efficient clustering of large datasets on GPUs, which is essential for my work with 20,000 sequences.

Describe alternatives you've considered

Currently, I have tried using HDBSCAN, but it results in too many noise clusters, making it less effective for my needs. I have also considered using CPU-based implementations of OPTICS, but they are not feasible for my dataset size due to their slow performance. Another alternative is to use KMeans or other clustering algorithms, but they do not handle varying densities as effectively as OPTICS.

Thank you for considering this feature request.

dantegd commented 3 months ago

Thanks for the issue @astomer2, this is an interesting suggestion. Tagging @cjnolet who might have some additional thoughts here.