rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 526 forks source link

[BUG] terminate called after throwing an instance of 'raft::cuda_error' #4474

Open ztf-ucas opened 2 years ago

ztf-ucas commented 2 years ago

terminate called after throwing an instance of 'raft::cuda_error' Hi, I'm using cuml.HDBSCAN and the following problem was encountered.

`terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=_deps/raft-src/cpp/include/raft/cudart_utils.h line=267: call='cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type), cudaMemcpyDeviceToDevice, stream)', Reason=cudaErrorInvalidValue:invalid argument Obtained 32 stack frames

0 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x46) [0x7f1bd4f95056]

1 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft10cuda_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xc9) [0x7f1bd4f95e39]

2 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft10copy_asyncIiEEvPT_PKS1_mN3rmm16cuda_stream_viewE+0x138) [0x7f1bd522f948]

3 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft9hierarchy6detail16build_sorted_mstIifN2ML7HDBSCAN22FixConnectivitiesRedOpIifEEEEvRKNS_8handle_tEPKT0_PKT_SF_SC_mmPSD_SG_PSA_SG_mT1_NS_8distance12DistanceTypeEi+0x4c2) [0x7f1bd527e942]

4 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN2ML7HDBSCAN13build_linkageIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsERNSB_28robust_single_linkage_outputIT_S6_EE+0x372) [0x7f1bd5281512]

5 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEE+0x7e) [0x7f1bd521759e]

6 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x43ec2) [0x7f1de251cec2]

7 in python(PyObject_Call+0x24d) [0x56056760d35d]

8 in python(_PyEval_EvalFrameDefault+0x21bf) [0x5605676b64ef]

9 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]

10 in python(PyEval_EvalCodeEx+0x39) [0x560567697e19]

11 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x2c298) [0x7f1de2505298]

12 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x2c4f9) [0x7f1de25054f9]

13 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x3c072) [0x7f1de2515072]

14 in python(PyObject_Call+0x24d) [0x56056760d35d]

15 in python(_PyEval_EvalFrameDefault+0x21bf) [0x5605676b64ef]

16 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]

17 in python(+0x1b08b7) [0x5605676988b7]

18 in python(_PyEval_EvalFrameDefault+0x4e03) [0x5605676b9133]

19 in python(_PyFunction_Vectorcall+0x1a6) [0x560567697fc6]

20 in python(_PyEval_EvalFrameDefault+0x947) [0x5605676b4c77]

21 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]

22 in python(PyEval_EvalCodeEx+0x39) [0x560567697e19]

23 in python(PyEval_EvalCode+0x1b) [0x56056773a24b]

24 in python(+0x2522e3) [0x56056773a2e3]

25 in python(+0x26e543) [0x560567756543]

26 in python(+0x273562) [0x56056775b562]

27 in python(PyRun_SimpleFileExFlags+0x1b2) [0x56056775b742]

28 in python(Py_RunMain+0x36d) [0x56056775bcbd]

29 in python(Py_BytesMain+0x39) [0x56056775be79]

30 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f1fc17a4b97]

31 in python(+0x1e6d69) [0x5605676ced69]

Aborted (core dumped)`

cjnolet commented 2 years ago

Thanks for opening an issue about this @ztf-ucas. To isolate the cause of this failure, it would be helpful if you can provide a code snippet that we can use to reproduce it. It would also be useful to provide the dataset (or relevant details) if you are able.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Brillone commented 2 years ago

Hi, same issue is happening to me with different settings for the HDBSCAN model (some works).

For example it happens with the following parameters: model = HDBSCAN(min_cluster_size=15, min_samples=10)

A setting that did worked: model = HDBSCAN(min_cluster_size=5, min_samples=5)

My dataset has 2.5M samples with 64 dimensions (I can't provide the dataset).

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

preet2312 commented 1 year ago

@cjnolet @Brillone Hello, I encountered the same issue where some of the parameter combinations work and some throw the same error(if run as a python script. If I run it on jupyter notebook on vscode then it would give error like FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.)

I am using 4.7M samples with dimension of 50.

For Example: Doesn't work: model = HDBSCAN(min_cluster_size=1000, min_samples=10) Works: model = HDBSCAN(min_cluster_size=1000, min_samples=5)

Please let me know if there's any update.

Thanks.

mayurgd commented 1 year ago

@cjnolet Hi facing this error while executing HDBSCAN

terminate called after throwing an instance of 'raft::cuda_error'
what():  CUDA error encountered at: file=_deps/raft-src/cpp/include/raft/cudart_utils.h 
line=267: call='cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type),cudaMemcpyDeviceToDevice, stream)', 
Reason=cudaErrorInvalidValue:invalid argument

Below is a reproducable code that gives an error for me:

import numpy as np
import pandas as pd
from cuml.cluster import HDBSCAN as HDBSCAN_gpu

X = np.array([[-14.01115608,  -5.37217331, 314.        ],
       [-17.31538773,  -6.12932587,  22.        ],
       [-17.88701439,  -7.00569153,  16.        ],
       [-17.91534615,  -7.40659523,  12.        ],
       [-13.57449722,  -3.70668411,  12.        ],
       [-14.97053146,  -6.00550461,  51.        ],
       [-15.5725193 ,  -5.07519722,   2.        ],
       [-13.31140137,  -3.99990654,   5.        ],
       [-13.84429169,  -4.01345634,   1.        ],
       [-17.02877998,  -6.42786789,  46.        ],
       [-15.09358597,  -5.4496851 ,  22.        ],
       [-17.52828217,  -6.86034393,   4.        ],
       [-15.57351112,  -5.61835861,   4.        ],
       [-14.20898056,  -4.61386681,   8.        ],
       [-14.45912552,  -5.47292137,   1.        ],
       [-15.27561951,  -4.74104977,   1.        ]])
test = pd.DataFrame(X, columns=['x','y','repeat'])
test = test.loc[test.index.repeat(test.repeat)].drop(columns='repeat')
hdb = HDBSCAN_gpu(
                min_samples=10,
                min_cluster_size=15,
                cluster_selection_method="eom",
                metric="euclidean",
                gen_min_span_tree=True,
            )

labels = hdb.fit_predict(test)

HDBSCAN model runs without any error for min_samples < 5 anything greater than or equal to 5 gives raft::cuda_error [cuml version '22.02.00']

beckernick commented 1 year ago

We've made a variety of updates to HDBSCAN since v22.02. Does this error present if you use cuML 23.02?

mayurgd commented 1 year ago

@beckernick thanks for the response. I am in the process of upgrading my rapids docker image to version 23.02. Will update ones that is done. Require one suggestion regarding HDBSCAN though, Should duplicate row data be removed before applying HDBSCAN or should it be applied to data with duplicate rows? For eg as per the above code snippet : Should it be applied to X (non_duplicated array) or test (duplicated_dataframe)

beckernick commented 1 year ago

Duplicates can to some extent be seen as sample weights and and removing them might move your analysis farther away from the underlying ground truth data distribution from which your data is implicitly sampled. I'd probably leave them in.

beckernick commented 1 year ago

@preet2312 , do you have any information about your environment (library versions) and system platforms with which you experienced this issue?

mayurgd commented 1 year ago

@beckernick I updated rapids to v23.02 using rapidsai/rapidsai-core:23.02-cuda11.2-runtime-ubuntu20.04-py3.8 image. I still get 'raft::cuda_error' for the above mentioned example

Error Logs: databricks driver logs show

terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/databricks/conda/envs/rapids/include/raft/util/cudart_utils.hpp line=278:

databricks notebook shows

ConnectException: Connection refused (Connection refused) Error while obtaining a new communication channel ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

beckernick commented 1 year ago

Thanks for testing in 23.02 and creating a minimal reproducible example. I can reproduce this behavior.

The underlying error appears to be that a single linkage solution can't be found in at least some scenarios and this error is not caught and propagated back up to Python.

With REPS = 10000 I can reproduce this consistently. With smaller REPS, I can reproduce it intermittently.

import numpy as np
from cuml.cluster import HDBSCAN

REPS = 10000

X = np.arange(12)
tiled = np.tile(X, REPS).reshape(-1, 3)

clusterer = HDBSCAN()
clusterer.fit(tiled)
terminate called after throwing an instance of 'raft::logic_error'
  what():  RAFT failure at file=/opt/conda/conda-bld/work/cpp/src/hdbscan/detail/condense.cuh line=88: Multiple components found in MST or MST is invalid. Cannot find single-linkage solution. Found 79997 vertices total.
Obtained 56 stack frames
#0 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x7fe0782bfb8b]
#1 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft11logic_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7fe0782c040d]
#2 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7HDBSCAN6detail8Condense25build_condensed_hierarchyIifLi256EEEvRKN4raft8handle_tEPKT_PKT0_SA_iiRNS0_6Common18CondensedHierarchyIS8_SB_EE+0x10f6) [0x7fe07881a936]
#3 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7HDBSCAN12_fit_hdbscanIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPT_PS6_RNSB_14hdbscan_outputISE_S6_EE+0x1d5) [0x7fe078835195]
#4 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEEPf+0x246) [0x7fe078750706]
#5 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/cluster/hdbscan/hdbscan.cpython-310-x86_64-linux-gnu.so(+0x74f2a) [0x7fdf60510f2a]
#6 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/base.cpython-310-x86_64-linux-gnu.so(+0x1c35f) [0x7fdf6096135f]
#7 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyObject_Call+0x209) [0x55e944f23209]
#8 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#9 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c6e1) [0x55e944f226e1]
#10 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#11 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#12 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#13 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#14 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x4d0d) [0x55e944f0b23d]
#15 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1db6a2) [0x55e944fb16a2]
#16 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyEval_EvalCode+0x87) [0x55e944fb15e7]
#17 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e2c30) [0x55e944fb8c30]
#18 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x140d14) [0x55e944f16d14]
#19 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#20 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#21 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x1bb1) [0x55e944f080e1]
#22 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#23 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x1bb1) [0x55e944f080e1]
#24 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#25 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1faada) [0x55e944fd0ada]
#26 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14b41f) [0x55e944f2141f]
#27 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#28 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#29 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#30 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#31 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#32 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#33 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x13d0) [0x55e944f07900]
#34 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#35 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#36 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#37 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#38 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#39 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#40 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#41 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyObject_Call+0xb8) [0x55e944f230b8]
#42 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#43 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#44 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#45 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1db6a2) [0x55e944fb16a2]
#46 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyEval_EvalCode+0x87) [0x55e944fb15e7]
#47 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x20e3fc) [0x55e944fe43fc]
#48 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x2092d4) [0x55e944fdf2d4]
#49 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x9758d) [0x55e944e6d58d]
#50 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyRun_SimpleFileObject+0x1b5) [0x55e944fd94f5]
#51 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyRun_AnyFileObject+0x43) [0x55e944fd90a3]
#52 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(Py_RunMain+0x399) [0x55e944fd6279]
#53 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(Py_BytesMain+0x39) [0x55e944fa3dc9]
#54 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fe0fe28d083]
#55 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1cdcc1) [0x55e944fa3cc1]

Aborted (core dumped)

cc @cjnolet @tarang-jain @divyegala , as you may have looked at this code recently

cjnolet commented 1 year ago

@beckernick I think the reason you might be getting the error about convergence is generally not likely to happen in practice, It looks like the amount of duplicated rows are likely causing the mst to disregard additional edges. If that case does in fact end up becoming show stopper on real datasets, I think we should definitely figure out a way around it, however the error you are receiving is definitely explaining what's going on- theres just not enough information provided to connect the graph because of the duplicated edges and we need a connected graph in order to build the dendrogram.

I slightly tweaked the input and was able to reproduce original reported error. I do think we should investigate this one further (cc @tarang-jain who is looking into this):

>>> import numpy as np
>>> from cuml.cluster import HDBSCAN

>>> 
>>> 
>>> REPS = 10000
>>> X = np.arange(500)
>>> tiled = np.tile(X, REPS).reshape(-1, 3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: cannot reshape array of size 5000000 into shape (3)
>>> tiled = np.tile(X, REPS).reshape(-1, 10)
>>> clusterer = HDBSCAN()
>>> clusterer.fit(tiled)
terminate called after throwing an instance of 'raft::cuda_error'
  what():  CUDA error encountered at: file=/home/cjnolet/miniconda3/envs/cuml_2304_032323/include/raft/util/cudart_utils.hpp line=244: call='cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type), cudaMemcpyDeviceToDevice, stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 29 stack frames
#0 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x84) [0x7f02ed253f84]
#1 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7f02ed2549dd]
#2 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN4raft10copy_asyncIiEEvPT_PKS1_mN3rmm16cuda_stream_viewE+0x19a) [0x7f02ed6f7dfa]
#3 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN2ML7HDBSCAN13build_linkageIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPS6_RNSB_28robust_single_linkage_outputIT_S6_EE+0x19fa) [0x7f02ed7aa64a]
#4 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN2ML7HDBSCAN12_fit_hdbscanIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPT_PS6_RNSB_14hdbscan_outputISE_S6_EE+0xf1) [0x7f02ed7abdc1]
#5 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEEPf+0x25a) [0x7f02ed6d87fa]
#6 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/python3.10/site-packages/cuml/cluster/hdbscan/hdbscan.cpython-310-x86_64-linux-gnu.so(+0x75f7d) [0x7f01c212df7d]
#7 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/python3.10/site-packages/cuml/internals/base.cpython-310-x86_64-linux-gnu.so(+0x248ea) [0x7f01c3fe98ea]
#8 in python(PyObject_Call+0x209) [0x55f302743139]
#9 in python(_PyEval_EvalFrameDefault+0x2ec2) [0x55f302729cb2]
#10 in python(+0x14b7a1) [0x55f3027427a1]
#11 in python(_PyEval_EvalFrameDefault+0x2ec2) [0x55f302729cb2]
#12 in python(_PyFunction_Vectorcall+0x6f) [0x55f302736f8f]
#13 in python(_PyEval_EvalFrameDefault+0x2ec2) [0x55f302729cb2]
#14 in python(+0x14b641) [0x55f302742641]
#15 in python(_PyEval_EvalFrameDefault+0x4d0d) [0x55f30272bafd]
#16 in python(+0x1d8a82) [0x55f3027cfa82]
#17 in python(PyEval_EvalCode+0x87) [0x55f3027cf9c7]
#18 in python(+0x20b82c) [0x55f30280282c]
#19 in python(+0x206704) [0x55f3027fd704]
#20 in python(+0x1173ae) [0x55f30270e3ae]
#21 in python(_PyRun_InteractiveLoopObject+0xcc) [0x55f30270e544]
#22 in python(+0x96790) [0x55f30268d790]
#23 in python(PyRun_AnyFileExFlags+0x4b) [0x55f30270e6be]
#24 in python(+0x93931) [0x55f30268a931]
#25 in python(Py_BytesMain+0x39) [0x55f3027c2089]
#26 in /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f03ff629d90]
#27 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f03ff629e40]
#28 in python(+0x1caf81) [0x55f3027c1f81]

Aborted (core dumped)
divyegala commented 1 year ago

It looks like the amount of duplicated rows are likely causing the mst to disregard additional edges.

I agree with this analysis.

You can probably find a solution by artificially introducing minor random noise in the dataset [0, delta) where delta=narrowest edge difference for every point before doing the reshape, so that the amount of duplicates reduces or vanishes. Maybe you can try that @beckernick in your script.

beckernick commented 1 year ago

Thanks for the suggestions. I agree that the error is clear, but it's uncaught and causes a segfault. Python user code should ideally not cause a segfault, even if rare scenarios like this (I know this is unlikely to occur naturally). Can we catch and propagate this error up?

tarang-jain commented 1 year ago

I was able to reproduce both @beckernick's error and @cjnolet's error by tweaking the arange parameter. I agree with @cjnolet's analysis because of the repeated zero-weight edges in the KNN. Also, since the number of repeated points is greater than min_samples, core distances of all points would be zero. I tried adjusting min_samples to be just greater than REPS and the error does not occur. I can still dig deeper to find the exact piece of code that causes this error.

MartinKlefas commented 1 year ago

@beckernick I think the reason you might be getting the error about convergence is generally not likely to happen in practice, It looks like the amount of duplicated rows are likely causing the mst to disregard additional edges. If that case does in fact end up becoming show stopper on real datasets, I think we should definitely figure out a way around it, however the error you are receiving is definitely explaining what's going on- theres just not enough information provided to connect the graph because of the duplicated edges and we need a connected graph in order to build the dendrogram.

Hi, I'm getting this with a "real" dataset.

It's a set of images that I'm using for anomaly/outlier detection - the code seems to run on smaller numbers of similar images, up to around 350,000 - 400,000 samples, but if I go much beyond that then I get this same crash behaviour. This only happens when I reduce the image vectors down to certain sizes through PCA though - suggesting that I have inadvertently created multiple identical data points as mentioned above.

I'm happy to provide code and input-data samples if that'll help - how's it best to get it over to you? enough of the dataset to reproduce the error is around 10GB

tarang-jain commented 1 year ago

@MartinKlefas Have you tried to increase min_samples? Adding non-zero edges to the KNN should lead to convergence. If you can compute the maximum number of repeated inputs in your dataset and set min_samples to be greater than that, it should work.

MartinKlefas commented 1 year ago

@MartinKlefas Have you tried to increase min_samples? Adding non-zero edges to the KNN should lead to convergence. If you can compute the maximum number of repeated inputs in your dataset and set min_samples to be greater than that, it should work.

Thanks, I didn't do the full computation, but just multiplied min_samples by 10 and the clustering ran again.

NitinVishalKulkarni commented 10 months ago

I am experiencing this as well. My dataset is generated from a Reinforcement Learning environment (Atari Pong).

terramars commented 1 week ago

I just ran into what I think is something related to this bug, the dataset is related to the Vesuvius Challenge and is spatial. Unfortunately, it looks like there are no duplicates though and consequently I'm not sure how to get closer to identifying the problem.

terramars commented 1 week ago

managed to pull the segfaulting data -

x = np.array([[168.5, 174.75, 243. ], [172. , 125. , 249.5 ], [172. , 172. , 245.5 ], [172. , 172. , 245.75], [172. , 172. , 246. ], [172. , 172. , 246.25], [172. , 172. , 246.5 ], [172. , 172. , 246.75], [172. , 172. , 247. ], [172. , 174. , 246.25], [172. , 174. , 246.5 ], [172. , 174. , 246.75], [172. , 174. , 247. ], [172.25, 125.5 , 249.75], [172.25, 172.25, 246.25], [172.25, 172.5 , 247. ], [172.25, 172.5 , 247.25], [172.25, 173.75, 200.5 ], [172.25, 173.75, 200.75], [172.25, 174. , 247. ], [172.5 , 172. , 245. ], [172.5 , 172.25, 246.5 ], [172.5 , 172.25, 246.75], [172.5 , 172.25, 247. ], [172.5 , 172.5 , 245.5 ], [172.5 , 174. , 246.25], [172.75, 172. , 245. ], [172.75, 172.25, 246.75], [172.75, 172.25, 247. ], [172.75, 172.5 , 245.25], [172.75, 174. , 245.75], [173. , 172. , 245.25], [173. , 172.25, 246.75], [173. , 172.25, 247. ], [173. , 174. , 245.5 ], [173.25, 125. , 249.25], [173.25, 125.25, 249.75], [173.25, 170.5 , 245.75], [173.25, 170.5 , 247.25], [173.25, 172. , 245.5 ], [173.25, 172. , 246.5 ], [173.25, 172. , 246.75], [173.25, 172. , 247. ], [173.25, 172.25, 246.75], [173.25, 172.25, 247. ], [173.25, 174.25, 243.5 ], [173.5 , 125. , 249.5 ], [173.5 , 125.25, 249.75], [173.5 , 125.5 , 249.5 ], [173.5 , 125.5 , 249.75], [173.5 , 170.5 , 246.75], [173.5 , 170.5 , 247. ], [173.5 , 172. , 245.25], [173.5 , 172. , 246.5 ], [173.5 , 172. , 246.75], [173.5 , 172. , 247. ], [173.5 , 172.25, 246.75], [173.5 , 174. , 244.25], [173.5 , 174.5 , 242.25], [173.75, 125. , 249.75], [173.75, 125.5 , 249.5 ], [173.75, 125.5 , 249.75], [173.75, 171.75, 239.5 ], [173.75, 171.75, 239.75], [173.75, 172. , 244.75], [173.75, 172. , 245. ], [173.75, 172. , 245.25], [173.75, 172. , 246.5 ], [173.75, 172. , 246.75], [173.75, 174. , 241.5 ], [173.75, 174. , 244. ], [173.75, 174.25, 242.25], [173.75, 174.5 , 241.75], [174. , 125.5 , 249.5 ], [174. , 125.5 , 249.75], [174. , 125.75, 249.25], [174. , 125.75, 249.5 ], [174. , 125.75, 249.75], [174. , 169.75, 249. ], [174. , 170.25, 247.5 ], [174. , 170.5 , 245.5 ], [174. , 171.75, 239. ], [174. , 171.75, 239.25], [174. , 171.75, 239.5 ], [174. , 172. , 245. ], [174. , 172. , 246.25], [174. , 172. , 246.5 ], [174. , 172. , 246.75], [174. , 173.75, 240.75], [174. , 174. , 241.5 ], [174.25, 125.5 , 249.5 ], [174.25, 125.5 , 249.75], [174.25, 125.75, 248.75], [174.25, 125.75, 249. ], [174.25, 125.75, 249.25], [174.25, 125.75, 249.5 ], [174.25, 125.75, 249.75], [174.25, 172. , 245.25], [174.25, 174. , 241.75], [174.25, 174. , 243.75], [174.5 , 125.5 , 249.5 ], [174.5 , 125.5 , 249.75], [174.5 , 125.75, 201.75], [174.5 , 125.75, 202.75], [174.5 , 125.75, 248.5 ], [174.5 , 125.75, 248.75], [174.5 , 125.75, 249. ], [174.5 , 125.75, 249.25], [174.5 , 125.75, 249.5 ], [174.5 , 125.75, 249.75], [174.5 , 174. , 242.25], [174.5 , 174. , 243.5 ], [174.75, 125.5 , 249.5 ], [174.75, 125.5 , 249.75], [174.75, 125.75, 201.5 ], [174.75, 125.75, 203. ], [174.75, 125.75, 248.5 ], [174.75, 125.75, 248.75], [174.75, 125.75, 249. ], [174.75, 125.75, 249.25], [174.75, 125.75, 249.5 ], [174.75, 125.75, 249.75], [174.75, 170.25, 247. ], [174.75, 171.5 , 241.5 ], [174.75, 174. , 241.5 ], [174.75, 174. , 242.5 ], [174.75, 174. , 242.75], [174.75, 174. , 243. ], [174.75, 174.75, 234.5 ]] )

hdb = HDBSCAN(min_samples=10, min_cluster_size=10, allow_single_cluster=True) db = hdb.fit(x)

terramars commented 1 week ago

this result is not sensitive to the number of samples or cluster size - all segfault. dropping the last data element also doesn't segfault, but it does hang indefinitely.

divyegala commented 5 days ago

@terramars while there are no explicit duplicates, it looks to me that all the points are quite close in distance to the previous point. What precision are you running with? Can you try running with np.float64?