h2oai / h2o4gpu

H2Oai GPU Edition
Apache License 2.0
460 stars 94 forks source link

Implement GPU based DBSCAN clustering algo #239

Open teju85 opened 6 years ago

teju85 commented 6 years ago

Original serial algo: https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf GPU based implementations:

  1. CUDA-DClust: http://www.dbs.ifi.lmu.de/Publikationen/Boehm/CIKM_09.pdf. It uses a collision matrix approach to name clusters. also has an index data structure on GPU that helps reduce computational complexity of eps-neighborhood detection.
  2. G-DBSCAN: https://pdfs.semanticscholar.org/31df/abb8d1085ac468b60a83d32af2a558407c95.pdf. Simpler implementation. Generates a proximity graph out of the eps-neighborhood info and then performs BFS traversal to name the clusters. Thus exposing more parallelism than the previous approach. IMHO, I think we should start with an implementation based on G-DBSCAN and the refine the implementation as per performance profile.
en-casa commented 6 years ago

I'm new to this repository but I'm interested in helping to implement G-DBSCAN as a part of a class project at NCSU (I'm an MS student.) I noticed these two repos: DBSCAN and CudaDBClustering, that may be good starting points. The latter is a little messy.

Has this issue made progress? Is there anything I can help with?

thanks, nico

mdymczyk commented 6 years ago

From what I know @teju85 will not be implementing this for the time being (think they decided to work on Kalman Filters) so you'd have to do the whole implementation yourself (which of course you're more than welcome to do). This would mean:

1) First a CUDA backend - you can either code it yourself based on the papers @teju85 posted or use an existing repo but in such a case you have to make sure the licensing is appropriate. I don't see any license in the repo you posted so you'd have to reach out to the original committer. We require Apache 2.0 compatible licensing.

You can follow the KMeans/tSVD file structure. All our GPU code is here https://github.com/h2oai/h2o4gpu/tree/master/src/gpu

2) A Python wrapper - you can have a look at existing examples for KMeans (kmeans.py) or truncated SVD. Examples here https://github.com/h2oai/h2o4gpu/tree/master/src/interface_py/h2o4gpu/solvers

3) Benchmarks against existing implementations (i.e. sklearn).

4) If you have time (and will) you can work on C++ implementation.

Even only a CUDA backend (without Python wrappers and C++ impl) would be of great help!

For starters, it would be good if you just cloned and built the repo, maybe ran tests, check if everything is working. Then go through one existing algo (I think truncated SVD would be easiest) and if you think you're up to it fork the repo and start working on your impl.

If you have any questions ping us here or on gitter.

mdymczyk commented 6 years ago

@n-casale my bad, misunderstood @teju85 - they already started working on DBSCAN implementation. Not sure if they need any help with it but probably won't be very parallelizable.

teju85 commented 6 years ago

Thanks Mateusz for the clarification.

@n-casale Thanks for showing interest and also for the links! I did look at DBSCAN which implements the G-DBSCAN algo and I believe we can improve the distance-matrix evaluation kernel (which is the biggest contributor towards the runtime). However, it will still be a few weeks before we get to the part of implementing the connected-component-labelling stage (BFS traversal part, I mean). Hence, if you are still interested, that can be an aspect you can contribute to.

Regards, Thejaswi

tomaszdudek7 commented 6 years ago

Hello @teju85 is there already a PR regarding this issue we could follow?

mdymczyk commented 6 years ago

Czesc @spaszek :-)

From what I remember @teju85 got a bit sidetracked with other work related issues so I wouldn't count on getting DBSCAN in the nearest future but let me follow up with him and see what we can do :-) Are you using DBSCAN in production? What would be your use case - how many GPUs are you guys using? What's your data size? We're trying to figure out what the users might need.

tomaszdudek7 commented 6 years ago

Oh, okay - I was under the impression you guys would have DBScan ready at Q1 2018 (as said in the post here https://blog.h2o.ai/2017/12/h2o4gpu-hands-lab-video-updates/ ) :P

To be honest, we are not using DBScan anywhere(yet) - I am just looking around for alternatives to scikit-learn in case its implementation would be not sufficient for us. It is kinda hard to measure our data size and GPU requirements right now though.

Do you plan to port the implementation to h2o-core? I would love to call DBScan from Scala easily, even without the GPUs. I work at a heavily JVM based company and we replace Python with Scala/Clojure whenever possible.

ąęśćóźłż - if you needed them in Japan :D

teju85 commented 6 years ago

Hi @spaszek , Sorry, there are currently no PR's for this ready from my side, yet. Like @mdymczyk mentioned, I got distracted with other items and couldn't get to this after the implementation of neighborhood calculation. :(

I'm hopeful of getting back to this soon, though. Regards

mdymczyk commented 6 years ago

@spaszek our roadmap is more of a blueprint than anything set in stone unfortunately as we're heavily understaffed :(

I'm actually working on Java bindings (using SWIG) as we speak and hope to have it integrated with h2o-3/flow before GTC next month (so you'll be able to use it through h2o-3 or just as standalone h2o4gpu4java lib:-). This will have certain limitations, though, as our GPU implementations are not node-distributed (something we're looking into but still not sure should we go with distribution on the CPU or GPU level and whether it is actually necessary).

tomaszdudek7 commented 6 years ago

@mdymczyk Would there be a way for me to contribute to h2o4gpu4java? :)

I'm already working on PR at sparkling water project. Should I just search for another open issue in your Jira?

mdymczyk commented 6 years ago

@spaszek getting Java bindings should be easy and I will probably have it working by the end of the week/next week. But if you want to use it through H2O-3 or Sparkling Water we'll have to somehow integrate that and there are 2 ways to do that:

1) write Java model classes for each of the h2o4gpu algorithms, there's some example classes for that here https://github.com/h2oai/h2o-3/tree/master/h2o-algos/src/main/java/hex/example. One "trick" will has to be used, our models have map and reduce functions, as originally H2O was based on M/R. Here, we wouldn't be using them but instead run all the modeling in a special setupLocal method.

2) go through the existing model classes for H2O-3 KMeans, GLM, PCA etc. and somehow add this as one of the backends.

Will have to have a chat with the h2o-3/sw teams, see which path to choose and then we'll make some JIRAs or github issues here. Will keep you posted :-)

tomaszdudek7 commented 6 years ago

great, thanks @mdymczyk

I would love to contribute to that if possible - please don't forget to mention me when you guys settle for a solution. I just started using h2o (and sparkling water) extensively. Contributing to the project would be a good way of learning how to actually use the library (as well as speeding you guys up a little in the process).

voycey commented 5 years ago

Hey @teju85 - are there any updates on this? We have tried several other parallelism methods (Spark etc) but I think that CUDA is the way to go - I also think that some of the other items we are currently doing would be a really good fit for the RTX technology (Spatial joins etc)

teju85 commented 5 years ago

Hi @voycey , I forgot to update this thread. We now have DBSCAN as part of cuML repo here: https://github.com/rapidsai/cuml.

voycey commented 5 years ago

This seems to have problems crashing unfortunately! Ill keep an eye on the issue that is debugging it!

teju85 commented 5 years ago

@voycey Fixes for crash issue should now be at the HEAD of rapidsai/cuml. Can you check and provide feedback, please? If you still find problems, file an issue in that repo and I'll look at it sooner.

teju85 commented 5 years ago

@mdymczyk Do you think we still need to keep this issue open?

voycey commented 5 years ago

I will have an attempt at it this week sometime if I can!

blackvvine commented 3 years ago

Sorry for necro-bumping, but has there been any progress or alternatives to this?

teju85 commented 3 years ago

@blackvvine please refer to my earlier message above. We implemented dbscan inside cuML. Not sure if we ever want to provide a wrapper for h2o4gpu, though. I'll leave this decision to the project admins. @pseudotensor ?

kanchanchy commented 1 year ago

@teju85 Which algorithm has been used to implement DBSCAN in cuML? CUDA-DClust or G-DBSCAN or something else?