Open zxygentoo opened 1 year ago
Thanks for filing this issue. It looks like we allow several metrics in the Cython layer but the C++ layer doesn't support them.
Would you be able to use the L2 distance for now? We'd also love to learn more about why you'd ideally use L1 vs another metric.
Minimal reproducible example:
import cuml
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=10000,
n_features=15,
random_state=12
)
clf = cuml.cluster.HDBSCAN(metric="l1")
clf.fit(X)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[4], line 13
4 X, y = make_blobs(
5 n_samples=10000,
6 n_features=15,
7 random_state=12
8 )
10 clf = cuml.cluster.HDBSCAN(
11 metric="l1"
12 )
---> 13 clf.fit(X)
File /raid/nicholasb/miniconda3/envs/rapids-23.04-bertopic/lib/python3.10/site-packages/cuml/internals/api_decorators.py:188, in _make_decorator_function.<locals>.decorator_function.<locals>.decorator_closure.<locals>.wrapper(*args, **kwargs)
185 set_api_output_dtype(output_dtype)
187 if process_return:
--> 188 ret = func(*args, **kwargs)
189 else:
190 return func(*args, **kwargs)
File /raid/nicholasb/miniconda3/envs/rapids-23.04-bertopic/lib/python3.10/site-packages/cuml/internals/api_decorators.py:393, in enable_device_interop.<locals>.dispatch(self, *args, **kwargs)
391 if hasattr(self, "dispatch_func"):
392 func_name = gpu_func.__name__
--> 393 return self.dispatch_func(func_name, gpu_func, *args, **kwargs)
394 else:
395 return gpu_func(self, *args, **kwargs)
File /raid/nicholasb/miniconda3/envs/rapids-23.04-bertopic/lib/python3.10/site-packages/cuml/internals/api_decorators.py:190, in _make_decorator_function.<locals>.decorator_function.<locals>.decorator_closure.<locals>.wrapper(*args, **kwargs)
188 ret = func(*args, **kwargs)
189 else:
--> 190 return func(*args, **kwargs)
192 return cm.process_return(ret)
File base.pyx:665, in cuml.internals.base.UniversalBase.dispatch_func()
File hdbscan.pyx:847, in cuml.cluster.hdbscan.hdbscan.HDBSCAN.fit()
RuntimeError: RAFT failure at file=/opt/conda/conda-bld/work/cpp/src/hdbscan/detail/reachability.cuh line=280: Currently only L2 expanded distance is supported
Obtained 64 stack frames
...
cc @tarang-jain @cjnolet
Yeah, as the message suggests, HDBSCAN doesn't yet support L1 distance computation. We recently re-wrote our brute-force so that we now have the primitives to support it while computing the mutual reachability, but we haven't yet implemented the support for it.
Thanks for the quick reply. Glad to hear this is making good progress.
As of now, and metic suggestions if I want to work on rather high dimension data?
@beckernick I've played with both L1/L2 in the original implementation. It seems when the dimension of the data increases L1 produces better clustering result.
I'm new at this and the tests I've done isn't in anyway scientific. Any suggestion is welcome, thanks!
In general, your experience matches my expectations for HDBSCAN on high-dimensionality datasets.
With that said, in most BERTopic workflows I've seen the UMAP parameters generally set between 5 and 15. The relative sparsity of the embedding space is less of an issue with fewer dimensions, so if you're using something similar to the standard BERTopic UMAP setup I'd think L2 is likely reasonably effective.
Regardless, we should not permit unusable distance functions in the Cython layer. I'll file a new issue, but if anyone is interested in making their first contribution to cuML, updating the Cython code to only permit using L2 would be a great first issue to tackle.
Describe the bug HDBSCAN using L1 metric failed at
Currently only L2 expanded distance is supported
error.Steps/Code to reproduce bug
Trying to speed up BERTopic by matching the original sklearn HDBSCAN implementation with a cuML one:
Error output:
Expected behavior UMAP already switched to cuML and it works nicely without change. Some digging into past issues seems we expect that with HDBSCAN too?
Environment details (please complete the following information):
Additional context The error persists when setting
gen_min_span_tree
andprediction_data
both toFalse
.I'm new to cuML, if and further steps needed to produce a more minimum error case, please kindly point out.
Thanks, Alex