rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.26k stars 535 forks source link

[FEA] Incremental version of dbscan and hdbscan #6015

Open PrinceP opened 3 months ago

PrinceP commented 3 months ago

Is your feature request related to a problem? Please describe. I wish I could use cuML to do incremental version of dbscan and hdbscan

Describe the solution you'd like For a set of streaming data, cuML dbscan/hdbscan can create clusters and return the labels. Then again the fit is called with older cache to make sure the labels preassigned are used again.

Describe alternatives you've considered Running the dbscan multiple number of times. We need to reassign the cluster ids back to older data

Additional context https://www.dbs.ifi.lmu.de/Publikationen/Papers/VLDB-98-IncDBSCAN.pdf

dantegd commented 3 months ago

Thanks for the issue @PrinceP, in general incremental and online algorithms are an area of great interest to us. Thanks for linking the paper, DBSCAN indeed will be very useful and we will work on it (of course, external contributions always accepted!) in the future, but can't give an estimated timeline currently.

HDBSCAN of course will be just if not more interesting and important, but on a first quick read it wasn't obvious to me if the results/approach of the paper would directly apply, so it might require some additional thought into it.

DataOmbudsman commented 3 months ago

There's an implementation of the Incremental DBSCAN paper in the incdbsan repository. For disclosure, I am the author of that project. In my experience it is rather tricky to implement it correctly so I thought to link to the project in case it helps.

cjnolet commented 3 months ago

hi @DataOmbudsman, thank you for sharing your work with us! I'm going to take a look at your work, however in the meantime is there a specific operation or step in the (incremental)dbscan process that you see being parallelizable on the GPU? Does the incremental bit still involve a bunch of fairly indepenent operations that could be parallelized or made concurrent?

DataOmbudsman commented 2 months ago

@cjnolet In DBSCAN there's a distance matrix calculation step that could be parallelized. It's also there in the incremental version (during insertion). (Note that the above project currently implements insertion in an online manner rather than as batch insertion, making less space for parallelization.) There's also a connected component search step (during insertion) and a graph traversal step (during deletion) in incremental DBSCAN. I'm not sure though if they can be parallelized on GPU.