rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.14k stars 526 forks source link

multi-gpu dbscan Segmentation fault #5961

Open Cocoaxx opened 1 month ago

Cocoaxx commented 1 month ago

Describe the bug when I try to use multi-gpu dbscan, I got (Segmentation fault: invalid permissions for mapped object at address 0x7f0c8e0007c0)

image

Steps/Code to reproduce bug

image

Environment details (please complete the following information):

dantegd commented 1 month ago

Thanks for the issue @Cocoaxx, the permissions issue makes me believe this might have to do with UCX in the system, with the first warning in the trace being suspicious. Maybe someone like @pentschev would know if I'm looking in the correct place to triage this issue.

pentschev commented 1 month ago

The warning saying transports 'cuda_copy', ... are not available means UCX wasn't compiled with CUDA support. I also notice you have ucx-py-cu11 installed but no libucx-cu11 which suggests you probably installed UCX and UCX-Py from source. If that's the case, I would suggest relying on libucx-cu11 instead of a UCX system install, if that's not possible then you would have to recompile UCX with --with-cuda=$CUDA_HOME, where CUDA_HOME points to the CUDA's system installation, generally /usr/local/cuda but can be elsewhere in your system.

Cocoaxx commented 1 month ago

thank you for your quick reply. I will have a try. By the way, I have to use python 3.8 and rapids 23.4, but I find that ucx-py needs python version >= 3.9? my dockerfile install instruction is like this RUN source ~/.bashrc \ && conda deactivate && conda activate env-3.8.8\ && pip install protobuf==3.20.1 \ && pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com \ && pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com \ && pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com \ && pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118 \ && pip install MarkupSafe==2.0.1 \ && pip install scikit-learn \ && pip install transformers==4.38.2 \ && pip install sentence-transformers==2.2.2 \ && wget -P /tmp $GENERIC_REPO_URL/cpu/clean-layer.sh \ && sh /tmp/clean-layer.sh \ && cd /data/ \ && ln -s miniconda3/envs/env-3.8.8 anaconda3

Cocoaxx commented 1 month ago

Now, I upgrade python version to 3.9, and still encounter this problem. I try to install libucx_cu11 with whl file, but I got ERROR: libucx_cu11-1.16.0.post1-py3-none-manylinux_2_28_x86_64.whl is not a supported wheel on this platform. Then I try to install ucx from source with the doc https://ucx-py.readthedocs.io/en/latest/install.html#source. I install ucx 1.17.0 from source with configure command ../contrib/configure-release --prefix=/data/miniconda3/ --with-cuda=/usr/local/cuda-11.8 --enable-mt --without-go --without-java and the build log like this

image

Then I try to reinstall ucx-py, but I got this error

image

Have I overlooked any important steps? Please give me some suggestions.

pentschev commented 1 month ago

It seems like you're using conda, in that case why are you attempting to install RAPIDS (cuML and UCX-Py both inclusive) for PyPI? A much easier choice is to install all RAPIDS packages with conda, you can have a look at the RAPIDS install selector tool for instructions.

This information is irrelevant if you use conda like I suggested above, but just for completeness: you specified --prefix=/data/miniconda3/ which is the path where conda is installed but you would have to install it to your conda environment which is --prefix=$CONDA_PREFIX, which is also what the UCX-Py documentation says. Finally, the latest picture seems like you're trying to install ucx-py=0.30 which is very old, for RAPIDS 24.04 the matching version would be ucx-py=0.37, for RAPIDS 24.06 the matching version is ucx-py=0.38, and so on.

Cocoaxx commented 1 month ago

Our images are all tlinux, similar to CentOS, not Ubuntu. RAPIDS 24.04 don't support it. We try to install cuml cudf 23.04 which can work well with single gpu, but got error when use multi-gpu. Is there any way to solve this problem?

pentschev commented 1 month ago

It's true that we don't provide system packages and docker images beyond RockyLinux and Ubuntu. However, with a conda install (which you do have, according to the conda deactivate && conda activate env-3.8.8 line you posted above) you should be able to install all RAPIDS packages, including UCX/UCX-Py. With that, you can install first all the RAPIDS dependencies on your conda environment and then use pip to install anything else you need that's potentially not available from conda-forge. What I'm suggesting is something like this:

conda create -n env-3.8.8 -c rapidsai -c conda-forge -c nvidia  cudf=24.06 cuml=24.06 cugraph=24.06 python=3.9 cuda-version=11.8
&& conda activate env-3.8.8
&& conda install ...
&& pip install ...

The above will be the lowest barrier for you, and cuml/cugraph both will automatically install UCX/UCX-Py as dependencies. I also suggest using RAPIDS 24.06 as 24.04 is already old stable and we cannot provide support for it.

If you still need to build things from source for a different reason, then the next step for you would be to check what I said previously:

This information is irrelevant if you use conda like I suggested above, but just for completeness: you specified --prefix=/data/miniconda3/ which is the path where conda is installed but you would have to install it to your conda environment which is --prefix=$CONDA_PREFIX, which is also what the UCX-Py documentation says. Finally, the latest picture seems like you're trying to install ucx-py=0.30 which is very old, for RAPIDS 24.04 the matching version would be ucx-py=0.37, for RAPIDS 24.06 the matching version is ucx-py=0.38, and so on.

pentschev commented 1 month ago

One more piece of information that I've now confirmed with others more experienced than me is RAPIDS 24.04 requires glibc>=2.17 and RAPIDS 24.06+ requires glibc>=2.28, see https://github.com/rapidsai/build-planning/issues/23 for more information. Therefore, for the conda install I proposed above to work you must ensure your system provides at least the minimum RAPIDS required version.

Cocoaxx commented 1 month ago

Thank you for your suggestion. Now I can install RAPIDS on my machine and use the multi GPU DBSCAN algorithm. But when I used it on the cloud development machine, I encountered another error.

企业微信截图_85919a95-8022-403d-b27a-393bdf9fd2cc

I suspect it's a problem with the graphics card driver version. The CUDA version of the development machine is 11.8 and the driver version is 450.156.00. But RAPIDS needs version 520.61.05 or newer, am I correct?

企业微信截图_ff56edaa-0d97-4dc4-988a-a257851031c8
pentschev commented 1 month ago

It's hard to say for sure, but CUDA 11.0 hasn't been supported since 2022, RAPIDS supports a minimum of CUDA 11.2 which requires 470.42.01 minimum. To take advantage of CUDA 11.8 features you'll indeed need 520.61.05, although it will run on 470.42.01 due to CUDA Enhanced Compatiblity with newer features being disabled.

Cocoaxx commented 1 month ago

Thank you for your reply. But when I run rapids on 2xA10, cuda11.8 and 470.141.03, I got error like this,

[1721380813.194737] [VM-192-150-centos:2662 :0] parser.c:2036 UCX WARN unused environment variables: UCX_WARN_UNUSED_ENV_VARS (maybe: UCX_WARN_UNUSED_ENV_VARS?); UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?) [1721380813.194737] [VM-192-150-centos:2662 :0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) Dask CUDA Cluster created and client connected. Sample data generated. DBSCAN model defined. VM-192-150-centos:2662:2857 [32750] NCCL INFO Bootstrap : Using eth0:9.130.192.150<0> VM-192-150-centos:2662:2857 [32751] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so) VM-192-150-centos:2662:2857 [32750] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so VM-192-150-centos:2662:2857 [0] NCCL INFO NET/Plugin: Using internal network plugin.

VM-192-150-centos:2871:2871 [32523] misc/cudawrap.cc:182 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'

VM-192-150-centos:2871:2871 [1868963956] init.cc:1832 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'

VM-192-150-centos:2866:2866 [32677] misc/cudawrap.cc:182 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'

VM-192-150-centos:2866:2866 [1868963956] init.cc:1832 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version' 2024-07-19 17:20:24,037 - distributed.worker - WARNING - Run Failed Function: _func_init_all args: (b'2\x92#\x0f\xd4+K\\x9bj\x9cy\xff\xd7eD', b"\xbc'|\xd5\xff\x02b\xa4\x02\x00\xa6!\t\x82\xc0\x96\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x1b\x147\xef\x7f\x00\x00\x80\xcdv9\xef\x7f\x00\x00\xe0\xe6o\x82\xef\x7f\x00\x00\xf8\xe6o\x82\xef\x7f\x00\x00\x9f\xf9N\x00\x00\x00\x00\x00\x96\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00 (v9\xef\x7f\x00\x00pIv9\xef\x7f\x00", True, {'ucx://127.0.0.1:35807': {'rank': 1, 'port': 41683}, 'ucx://127.0.0.1:55299': {'rank': 0, 'port': 33303}}, False, 0) kwargs: {'dask_worker': <Worker 'ucx://127.0.0.1:55299', name: 0, status: running, stored: 1, running: 0/1, ready: 0, comm: 0, waiting: 0>} Traceback (most recent call last): File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/worker.py", line 3185, in run result = await function(*args, kwargs) File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 450, in _func_init_all _func_init_nccl(sessionId, uniqueId, dask_worker=dask_worker) File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 515, in _func_init_nccl n.init(nWorkers, uniqueId, wid) File "nccl.pyx", line 151, in raft_dask.common.nccl.nccl.init RuntimeError: NCCL_ERROR: b'unhandled cuda error (run with NCCL_DEBUG=INFO for details)' 2024-07-19 17:20:24,036 - distributed.worker - WARNING - Run Failed Function: _func_init_all args: (b'2\x92#\x0f\xd4+K\\x9bj\x9cy\xff\xd7eD', b"\xbc'|\xd5\xff\x02b\xa4\x02\x00\xa6!\t\x82\xc0\x96\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x1b\x147\xef\x7f\x00\x00\x80\xcdv9\xef\x7f\x00\x00\xe0\xe6o\x82\xef\x7f\x00\x00\xf8\xe6o\x82\xef\x7f\x00\x00\x9f\xf9N\x00\x00\x00\x00\x00\x96\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00 (v9\xef\x7f\x00\x00pIv9\xef\x7f\x00", True, {'ucx://127.0.0.1:35807': {'rank': 1, 'port': 41683}, 'ucx://127.0.0.1:55299': {'rank': 0, 'port': 33303}}, False, 0) kwargs: {'dask_worker': <Worker 'ucx://127.0.0.1:35807', name: 1, status: running, stored: 1, running: 0/1, ready: 0, comm: 0, waiting: 0>} Traceback (most recent call last): File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/worker.py", line 3185, in run result = await function(*args, *kwargs) File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 450, in _func_init_all _func_init_nccl(sessionId, uniqueId, dask_worker=dask_worker) File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 515, in _func_init_nccl n.init(nWorkers, uniqueId, wid) File "nccl.pyx", line 151, in raft_dask.common.nccl.nccl.init RuntimeError: NCCL_ERROR: b'unhandled cuda error (run with NCCL_DEBUG=INFO for details)' Traceback (most recent call last): File "/workspace/user_code/nickname_seq_cluster/test.py", line 47, in labels = dbscan.fit_predict(ddf) File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/cuml/dask/cluster/dbscan.py", line 160, in fit_predict self.fit(X, out_dtype) File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper return func(args, kwargs) File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/cuml/dask/cluster/dbscan.py", line 119, in fit comms.init() File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 200, in init self.client.run( File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/client.py", line 2991, in run return self.sync( File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/utils.py", line 358, in sync return sync( File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/utils.py", line 434, in sync raise error File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/utils.py", line 408, in f result = yield future File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/tornado/gen.py", line 766, in run value = future.result() File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/client.py", line 2896, in _run raise exc File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 450, in _func_init_all _func_init_nccl(sessionId, uniqueId, dask_worker=dask_worker) File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 515, in _func_init_nccl n.init(nWorkers, uniqueId, wid) File "nccl.pyx", line 151, in raft_dask.common.nccl.nccl.init RuntimeError: NCCL_ERROR: b'unhandled cuda error (run with NCCL_DEBUG=INFO for details)' 2024-07-19 17:20:24,207 - distributed.scheduler - ERROR - Removing worker 'ucx://127.0.0.1:55299' caused the cluster to lose scattered data, which can't be recovered: {'DataFrame-a9260fb655755d3cde1fc36cae8236b9'} (stimulus_id='worker-send-comm-fail-1721380824.2074068')

pentschev commented 1 month ago

The error seems to stem from:

VM-192-150-centos:2871:2871 [32523] misc/cudawrap.cc:182 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'

@cjnolet @viclafargue would you be able to help here with the NCCL errors in RAFT? What is the minimum required driver version for it, the user is running CUDA 11.8 on 470.141.03 (CUDA 11.2), would an upgrade of the driver be required or perhaps a downgrade to CUDA 11.2 build for their system?