rapidsai / cuml

cuML - RAPIDS Machine Learning Library
Apache License 2.0
4.27k stars 536 forks source link

[FEA] "Precomputed" Distance Matrix in (some) Clustering Algorithms #4516

Open Mortom123 opened 2 years ago

Mortom123 commented 2 years ago

Sometimes we do not have point representations in space but rather only distances between those points. Therefore it would be great if some algorithms (I'm especially interested in HDBSCAN and Agglomerative Clustering) are able to work on precomputed (sparse) distance matrices, similar to using "precomputed" metric in a lot of sklearn algorithms.

Personally, I'm working with biological, structural data, hence I only have differences in structure but not points in space.

There are several issues that also relate to this FEA - #4475 #4460 (#1192, #4409), and the implementation for e.g. DBSCAN already happened with issue #3302.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

erke-apoqlar commented 2 years ago


Are there any updates about making precomputed matrixes available for HDBSCAN?

SnzFor16Min commented 9 months ago

Just attempted to perform HDBSCAN on a cupyx.scipy.sparse._csr.csr_matrix and received immediate complaints upon the sparse input:

  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
  File "base.pyx", line 687, in cuml.internals.base.UniversalBase.dispatch_func
  File "hdbscan.pyx", line 762, in cuml.cluster.hdbscan.hdbscan.HDBSCAN.fit
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/input_utils.py", line 380, in input_to_cuml_array
    arr = CumlArray.from_input(
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/array.py", line 1114, in from_input
    arr = cls(X, index=index, order=requested_order, validate=False)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/array.py", line 292, in __init__
    new_data = cur_xpy.asarray(data, dtype=dtype)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cupy/_creation/from_data.py", line 88, in asarray
    return _core.array(a, dtype, False, order, blocking=blocking)
  File "cupy/_core/core.pyx", line 2379, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2406, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2541, in cupy._core.core._array_default
ValueError: setting an array element with a sequence.

As I notice there's a SparseCumlArray class but surprisingly HDBSCAN does not buy it either:

  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
  File "base.pyx", line 687, in cuml.internals.base.UniversalBase.dispatch_func
  File "hdbscan.pyx", line 762, in cuml.cluster.hdbscan.hdbscan.HDBSCAN.fit
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/input_utils.py", line 380, in input_to_cuml_array
    arr = CumlArray.from_input(
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/array.py", line 1114, in from_input
    arr = cls(X, index=index, order=requested_order, validate=False)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cuml/internals/array.py", line 292, in __init__
    new_data = cur_xpy.asarray(data, dtype=dtype)
  File "/U_2021PZZZJC0001/jiaxin.guo/miniforge3/envs/py311/lib/python3.11/site-packages/cupy/_creation/from_data.py", line 88, in asarray
    return _core.array(a, dtype, False, order, blocking=blocking)
  File "cupy/_core/core.pyx", line 2379, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2406, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2541, in cupy._core.core._array_default
TypeError: float() argument must be a string or a real number, not 'SparseCumlArray'

Looking forward to any suggestion or support schedule for this, as precomputed, sparse distance matrices are common in clustering algorithms.

KanishkT123 commented 2 months ago

@cjnolet , what would it take to get this made and merged in? I'm happy to take a shot at it, no promises as to how far I get. But I'm working with some data right now that would very much benefit from 'cosine', and failing that, 'precompute' is a good option to get a lot of different metrics working.

I would just need some guidance on where to start.