rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.26k stars 535 forks source link

[BUG] Using L1 metric with HDBSCAN failed with 'Currently only L2 expanded distance is supported' error #5414

Open zxygentoo opened 1 year ago

zxygentoo commented 1 year ago

Describe the bug HDBSCAN using L1 metric failed at Currently only L2 expanded distance is supported error.

Steps/Code to reproduce bug

Trying to speed up BERTopic by matching the original sklearn HDBSCAN implementation with a cuML one:

from cuml.cluster import HDBSCAN

hdbscan_model = HDBSCAN(
    verbose=verbose,
    min_samples=DEFUALT_MIN_SAMPLES,
    min_cluster_size=min_topic_size,
    metric=hdbscan_metric, # hdbscan_metric = "l1"
    gen_min_span_tree=True,
    prediction_data=True
)

# ... omitted

model = BERTopic(
    verbose=verbose,
    calculate_probabilities=calculate_probabilities,
    low_memory=low_memory,
    top_n_words=top_n_words,
    embedding_model=sentence_model,
    vectorizer_model=vectorizer,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model
)

model.fit(sentences, embeddings)

Error output:

Traceback (most recent call last):
  File "/home/ubuntu/channel-poc/fit.py", line 255, in <module>
    fit()
  File "/home/ubuntu/venv/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/channel-poc/fit.py", line 244, in fit
    _fit(
  File "/home/ubuntu/channel-poc/fit.py", line 190, in _fit
    model.fit(sentences, embeddings)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 280, in fit
    self.fit_transform(documents, embeddings, y)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 359, in fit_transform
    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 2903, in _cluster_embeddings
    self.hdbscan_model.fit(umap_embeddings, y=y)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 393, in dispatch
    return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 190, in wrapper
    return func(*args, **kwargs)
  File "base.pyx", line 665, in cuml.internals.base.UniversalBase.dispatch_func
  File "hdbscan.pyx", line 847, in cuml.cluster.hdbscan.hdbscan.HDBSCAN.fit
RuntimeError: RAFT failure at file=/project/cpp/src/hdbscan/detail/reachability.cuh line=280: Currently only L2 expanded distance is supported
Obtained 53 stack frames
#0 in /home/ubuntu/venv/lib/python3.10/site-packages/cuml/internals/../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x38) [0x7f1b5b169d28]
#1 in /home/ubuntu/venv/lib/python3.10/site-packages/cuml/internals/../libcuml++.so(_ZN4raft11logic_errorC2ERKSs+0x38) [0x7f1b5b16a478]
#2 in /home/ubuntu/venv/lib/python3.10/site-packages/cuml/internals/../libcuml++.so(_ZN2ML7HDBSCAN6detail12Reachability25mutual_reachability_graphIifEEvRKN4raft8handle_tEPKT0_mmNS4_8distance12DistanceTypeEiS8_PT_PS8_RNS4_6sparse6detail3COOIS8_SD_EE+0xb09) [0x7f1b5b71e879]
#3 in /home/ubuntu/venv/lib/python3.10/site-packages/cuml/internals/../libcuml++.so(_ZN2ML7HDBSCAN13build_linkageIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPS6_RNSB_28robust_single_linkage_outputIT_S6_EE+0x5cc) [0x7f1b5b75380c]
#4 in /home/ubuntu/venv/lib/python3.10/site-packages/cuml/internals/../libcuml++.so(_ZN2ML7HDBSCAN12_fit_hdbscanIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPT_PS6_RNSB_14hdbscan_outputISE_S6_EE+0xf7) [0x7f1b5b754397]
#5 in /home/ubuntu/venv/lib/python3.10/site-packages/cuml/internals/../libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEEPf+0x23b) [0x7f1b5b69363b]
#6 in /home/ubuntu/venv/lib/python3.10/site-packages/cuml/cluster/hdbscan/hdbscan.cpython-310-x86_64-linux-gnu.so(+0x79337) [0x7f1ad6db1337]
#7 in /home/ubuntu/venv/lib/python3.10/site-packages/cuml/internals/base.cpython-310-x86_64-linux-gnu.so(+0x2a6a8) [0x7f1b512ec6a8]
#8 in python3(PyObject_Call+0xbb) [0x5615bf74887b]
#9 in python3(_PyEval_EvalFrameDefault+0x2a40) [0x5615bf724af0]
#10 in python3(+0x16ac31) [0x5615bf747c31]
#11 in python3(PyObject_Call+0x122) [0x5615bf7488e2]
#12 in python3(_PyEval_EvalFrameDefault+0x2a40) [0x5615bf724af0]
#13 in python3(_PyFunction_Vectorcall+0x7c) [0x5615bf73a1ec]
#14 in python3(PyObject_Call+0x122) [0x5615bf7488e2]
#15 in python3(_PyEval_EvalFrameDefault+0x2a40) [0x5615bf724af0]
#16 in python3(+0x16ac31) [0x5615bf747c31]
#17 in python3(_PyEval_EvalFrameDefault+0x1a2f) [0x5615bf723adf]
#18 in python3(+0x16ac31) [0x5615bf747c31]
#19 in python3(_PyEval_EvalFrameDefault+0x1a2f) [0x5615bf723adf]
#20 in python3(_PyFunction_Vectorcall+0x7c) [0x5615bf73a1ec]
#21 in python3(_PyEval_EvalFrameDefault+0x81b) [0x5615bf7228cb]
#22 in python3(_PyFunction_Vectorcall+0x7c) [0x5615bf73a1ec]
#23 in python3(_PyEval_EvalFrameDefault+0x81b) [0x5615bf7228cb]
#24 in python3(_PyFunction_Vectorcall+0x7c) [0x5615bf73a1ec]
#25 in python3(_PyEval_EvalFrameDefault+0x6d5) [0x5615bf722785]
#26 in python3(_PyFunction_Vectorcall+0x7c) [0x5615bf73a1ec]
#27 in python3(PyObject_Call+0x122) [0x5615bf7488e2]
#28 in python3(_PyEval_EvalFrameDefault+0x2a40) [0x5615bf724af0]
#29 in python3(+0x16ac31) [0x5615bf747c31]
#30 in python3(PyObject_Call+0x122) [0x5615bf7488e2]
#31 in python3(_PyEval_EvalFrameDefault+0x2a40) [0x5615bf724af0]
#32 in python3(_PyFunction_Vectorcall+0x7c) [0x5615bf73a1ec]
#33 in python3(_PyEval_EvalFrameDefault+0x81b) [0x5615bf7228cb]
#34 in python3(+0x16ae91) [0x5615bf747e91]
#35 in python3(_PyEval_EvalFrameDefault+0x2a40) [0x5615bf724af0]
#36 in python3(_PyObject_FastCallDictTstate+0xc4) [0x5615bf72f634]
#37 in python3(_PyObject_Call_Prepend+0x5c) [0x5615bf744cac]
#38 in python3(+0x285610) [0x5615bf862610]
#39 in python3(_PyObject_MakeTpCall+0x25b) [0x5615bf7304ab]
#40 in python3(_PyEval_EvalFrameDefault+0x6765) [0x5615bf728815]
#41 in python3(+0x141ed6) [0x5615bf71eed6]
#42 in python3(PyEval_EvalCode+0x86) [0x5615bf815366]
#43 in python3(+0x265108) [0x5615bf842108]
#44 in python3(+0x25df5b) [0x5615bf83af5b]
#45 in python3(+0x264e55) [0x5615bf841e55]
#46 in python3(_PyRun_SimpleFileObject+0x1a8) [0x5615bf841338]
#47 in python3(_PyRun_AnyFileObject+0x43) [0x5615bf841033]
#48 in python3(Py_RunMain+0x2be) [0x5615bf8322de]
#49 in python3(Py_BytesMain+0x2d) [0x5615bf80832d]
#50 in /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f1cc3a29d90]
#51 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f1cc3a29e40]
#52 in python3(_start+0x25) [0x5615bf808225]

Expected behavior UMAP already switched to cuML and it works nicely without change. Some digging into past issues seems we expect that with HDBSCAN too?

Environment details (please complete the following information):

Additional context The error persists when setting gen_min_span_tree and prediction_data both to False.

I'm new to cuML, if and further steps needed to produce a more minimum error case, please kindly point out.

Thanks, Alex

beckernick commented 1 year ago

Thanks for filing this issue. It looks like we allow several metrics in the Cython layer but the C++ layer doesn't support them.

Would you be able to use the L2 distance for now? We'd also love to learn more about why you'd ideally use L1 vs another metric.

Minimal reproducible example:

import cuml
from sklearn.datasets import make_blobs

X, y = make_blobs(
    n_samples=10000,
    n_features=15,
    random_state=12
)

clf = cuml.cluster.HDBSCAN(metric="l1")
clf.fit(X)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 13
      4 X, y = make_blobs(
      5     n_samples=10000,
      6     n_features=15,
      7     random_state=12
      8 )
     10 clf = cuml.cluster.HDBSCAN(
     11     metric="l1"
     12 )
---> 13 clf.fit(X)

File /raid/nicholasb/miniconda3/envs/rapids-23.04-bertopic/lib/python3.10/site-packages/cuml/internals/api_decorators.py:188, in _make_decorator_function.<locals>.decorator_function.<locals>.decorator_closure.<locals>.wrapper(*args, **kwargs)
    185     set_api_output_dtype(output_dtype)
    187 if process_return:
--> 188     ret = func(*args, **kwargs)
    189 else:
    190     return func(*args, **kwargs)

File /raid/nicholasb/miniconda3/envs/rapids-23.04-bertopic/lib/python3.10/site-packages/cuml/internals/api_decorators.py:393, in enable_device_interop.<locals>.dispatch(self, *args, **kwargs)
    391 if hasattr(self, "dispatch_func"):
    392     func_name = gpu_func.__name__
--> 393     return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
    394 else:
    395     return gpu_func(self, *args, **kwargs)

File /raid/nicholasb/miniconda3/envs/rapids-23.04-bertopic/lib/python3.10/site-packages/cuml/internals/api_decorators.py:190, in _make_decorator_function.<locals>.decorator_function.<locals>.decorator_closure.<locals>.wrapper(*args, **kwargs)
    188         ret = func(*args, **kwargs)
    189     else:
--> 190         return func(*args, **kwargs)
    192 return cm.process_return(ret)

File base.pyx:665, in cuml.internals.base.UniversalBase.dispatch_func()

File hdbscan.pyx:847, in cuml.cluster.hdbscan.hdbscan.HDBSCAN.fit()

RuntimeError: RAFT failure at file=/opt/conda/conda-bld/work/cpp/src/hdbscan/detail/reachability.cuh line=280: Currently only L2 expanded distance is supported
Obtained 64 stack frames
...

cc @tarang-jain @cjnolet

cjnolet commented 1 year ago

Yeah, as the message suggests, HDBSCAN doesn't yet support L1 distance computation. We recently re-wrote our brute-force so that we now have the primitives to support it while computing the mutual reachability, but we haven't yet implemented the support for it.

zxygentoo commented 1 year ago

Thanks for the quick reply. Glad to hear this is making good progress.

As of now, and metic suggestions if I want to work on rather high dimension data?

zxygentoo commented 1 year ago

@beckernick I've played with both L1/L2 in the original implementation. It seems when the dimension of the data increases L1 produces better clustering result.

I'm new at this and the tests I've done isn't in anyway scientific. Any suggestion is welcome, thanks!

beckernick commented 1 year ago

In general, your experience matches my expectations for HDBSCAN on high-dimensionality datasets.

With that said, in most BERTopic workflows I've seen the UMAP parameters generally set between 5 and 15. The relative sparsity of the embedding space is less of an issue with fewer dimensions, so if you're using something similar to the standard BERTopic UMAP setup I'd think L2 is likely reasonably effective.

Regardless, we should not permit unusable distance functions in the Cython layer. I'll file a new issue, but if anyone is interested in making their first contribution to cuML, updating the Cython code to only permit using L2 would be a great first issue to tackle.