rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.26k stars 535 forks source link

[FEA] Add KNN Sparse Output to Dask Layer #2547

Open aleksficek opened 4 years ago

aleksficek commented 4 years ago

Overview Scikit-learn provides a kneighbors_graph feature that performs a kneighbors and returns a Sparse CSR matrix. This is being implemented in https://github.com/rapidsai/cuml/pull/2461 but to have this in the Dask layer, cupy-backed sparse arrays need to have desired functionality completed. This functionality is being completed as part of https://github.com/cupy/cupy/pull/3486 which kneighbors_graph in Dask depends so the new functionality must be merged into CuPy before hand (ETA: early August).

Additional context

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

vdet commented 2 years ago

The overview by aleksficek above sums it all technically. kneighbors_graph is much needed because KNN graphs have applications way beyond KNN classification or nearest neighbors queries. KNN graph are routinely constructed, for example, to cluster high dimensional single cell RNA sequencing datasets. this requires access to the full KNN graph.

In Scikit-learn and cuml, kneighbors_graph returns the KNN graph that is needed to run, for example, community partitioning algorithms such as the Leiden algorithm. It is important to have this functionality in the dask version, because GPU memory drastically limits the size of the graphs that can be constructed with the single GPU version.