Open PrinceP opened 3 months ago
Thanks for the issue @PrinceP, in general incremental and online algorithms are an area of great interest to us. Thanks for linking the paper, DBSCAN indeed will be very useful and we will work on it (of course, external contributions always accepted!) in the future, but can't give an estimated timeline currently.
HDBSCAN of course will be just if not more interesting and important, but on a first quick read it wasn't obvious to me if the results/approach of the paper would directly apply, so it might require some additional thought into it.
There's an implementation of the Incremental DBSCAN paper in the incdbsan repository. For disclosure, I am the author of that project. In my experience it is rather tricky to implement it correctly so I thought to link to the project in case it helps.
hi @DataOmbudsman, thank you for sharing your work with us! I'm going to take a look at your work, however in the meantime is there a specific operation or step in the (incremental)dbscan process that you see being parallelizable on the GPU? Does the incremental bit still involve a bunch of fairly indepenent operations that could be parallelized or made concurrent?
@cjnolet In DBSCAN there's a distance matrix calculation step that could be parallelized. It's also there in the incremental version (during insertion). (Note that the above project currently implements insertion in an online manner rather than as batch insertion, making less space for parallelization.) There's also a connected component search step (during insertion) and a graph traversal step (during deletion) in incremental DBSCAN. I'm not sure though if they can be parallelized on GPU.
Is your feature request related to a problem? Please describe. I wish I could use cuML to do incremental version of dbscan and hdbscan
Describe the solution you'd like For a set of streaming data, cuML dbscan/hdbscan can create clusters and return the labels. Then again the fit is called with older cache to make sure the labels preassigned are used again.
Describe alternatives you've considered Running the dbscan multiple number of times. We need to reassign the cluster ids back to older data
Additional context https://www.dbs.ifi.lmu.de/Publikationen/Papers/VLDB-98-IncDBSCAN.pdf