Closed dxm447 closed 2 years ago
RMM 22.06.01 should be picked up by default from the rapidsai
conda channel on supported systems when using the rapids
metapackage (which pins to 11.7.0). Does this happen if you set rapids=22.06
?
Could you share provide additional information about your system and share the output of your mamba install command?
This is my output:
`Updating specs:
compilers
Package Version Build Channel Size ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Install: ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Summary:
Install: 431 packages
Total download: 2GB
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Confirm changes: [Y/n] `
@dxm447 could you confirm that this problem does not persist if you attempt to recreate the environment now? It is not clear from the above output which package versions were initially installed, but as @beckernick pointed out any of our 22.06.00 packages would have pulled in the incorrect cuda-python version. rmm and cudf were updated two weeks ago, but cugraph was only updated the day before this issue was created (and I do see it in your environment) so it's possible that you were still on an older version of cugraph that was missing the pinning.
tldr:
mamba install -y -c conda-forge -c rapidsai -c nvidia "cuml=22.04[build=cuda11_py39*]" "cudatoolkit=11.6"
picks up cuda-python=11.7.1
, and then import cuml
fails with:
…
File "/opt/conda/lib/python3.9/site-packages/rmm/__init__.py", line 16, in <module>
from rmm import mr
File "/opt/conda/lib/python3.9/site-packages/rmm/mr.py", line 14, in <module>
from rmm._lib.memory_resource import (
File "/opt/conda/lib/python3.9/site-packages/rmm/_lib/__init__.py", line 15, in <module>
from .device_buffer import DeviceBuffer
File "device_buffer.pyx", line 1, in init rmm._lib.device_buffer
TypeError: C function cuda.ccudart.cudaStreamSynchronize has wrong signature (expected __pyx_t_4cuda_7ccudart_cudaError_t (__pyx_t_4cuda_7ccudart_cudaStream_t), got cudaError_t (cudaStream_t))
It seems like adding cuda-python=11.6.1[build=py39*]
as an additional pin is enough to work around the issue (in my Dockerfile below)
Here is a Dockerfile that reproduces this:
FROM nvidia/cuda:11.6.1-base-ubuntu20.04
ENV PYTHON_VERSION_SHORT=39
ENV PYTHON_VERSION_FULL=3.9.13
ENV CONDA_VERSION_FULL=4.12.0
ENV MAMBA_VERSION_FULL=0.24.0
ENV CUDA_VERSION_MINOR=11.6
ENV RAPIDS_VERSION=22.04
ENV PATH=${PATH}:/opt/conda/bin
RUN apt-get update \
&& apt-get install -y wget \
&& wget -q "https://repo.anaconda.com/miniconda/Miniconda3-py${PYTHON_VERSION_SHORT}_${CONDA_VERSION_FULL}-Linux-x86_64.sh" -O ~/miniconda.sh \
&& /bin/bash ~/miniconda.sh -b -p /opt/conda \
&& apt-get clean \
&& conda install -y -c conda-forge "conda=${CONDA_VERSION_FULL}" "python=${PYTHON_VERSION_FULL}" "mamba=${MAMBA_VERSION_FULL}" pip \
&& mamba install -y -c conda-forge -c rapidsai -c nvidia "cuml=${RAPIDS_VERSION}[build=cuda11_py${PYTHON_VERSION_SHORT}*]" "cudatoolkit=${CUDA_VERSION_MINOR}" \
&& conda clean -afy
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/cuda-${CUDA_VERSION_MINOR}/compat"
ENTRYPOINT ["python"]
CMD ["-c", "import cuml"] # ❌ fails: TypeError: C function cuda.ccudart.cudaStreamSynchronize has wrong signature
Build:
docker build -t import-cuml .
Run:
docker run --rm import-cuml
I also pushed it to DockerHub: runsascoded/import-cuml:
docker pull runsascoded/import-cuml # 2.55G compressed, sorry, I tried squash+clean tricks I know of
docker run --rm runsascoded/import-cuml
# Traceback (most recent call last):
# File "<string>", line 1, in <module>
# File "/opt/conda/lib/python3.9/site-packages/cuml/__init__.py", line 17, in <module>
# from cuml.common.base import Base
# File "/opt/conda/lib/python3.9/site-packages/cuml/common/__init__.py", line 17, in <module>
# from cuml.common.array import CumlArray
# File "/opt/conda/lib/python3.9/site-packages/cuml/common/array.py", line 25, in <module>
# from cudf import DataFrame
# File "/opt/conda/lib/python3.9/site-packages/cudf/__init__.py", line 5, in <module>
# validate_setup()
# File "/opt/conda/lib/python3.9/site-packages/cudf/utils/gpu_utils.py", line 20, in validate_setup
# from rmm._cuda.gpu import (
# File "/opt/conda/lib/python3.9/site-packages/rmm/__init__.py", line 16, in <module>
# from rmm import mr
# File "/opt/conda/lib/python3.9/site-packages/rmm/mr.py", line 14, in <module>
# from rmm._lib.memory_resource import (
# File "/opt/conda/lib/python3.9/site-packages/rmm/_lib/__init__.py", line 15, in <module>
# from .device_buffer import DeviceBuffer
# File "device_buffer.pyx", line 1, in init rmm._lib.device_buffer
# TypeError: C function cuda.ccudart.cudaStreamSynchronize has wrong signature (expected __pyx_t_4cuda_7ccudart_cudaError_t (__pyx_t_4cuda_7ccudart_cudaStream_t), got cudaError_t (cudaStream_t))
It seems that installing 11.6 versions of some Rapids libraries picks up 11.7 versions that have C functions with incompatible type signatures.
In my original project, I'm installing a few Rapids/CUDA libraries directly (cudf
, cugraph
, cuml
, cudatoolkit
) to save time+space vs. a full rapids
install (I think cuSpatial
in particular was bringing in a large group of geo-related dependencies, and associated conda/mamba "solve" issues), and I ran into this. I pin Rapids 22.04.x or 22.06.x and CUDA 11.6, and at some point in the last month or so (probably when 11.7 releases started happening), the build broke because of this.
Here's an easy way to see the 11.6/11.7 mix that I end up with in the Dockerfile above (where I tried to pin 11.6):
docker run --rm runsascoded/import-cuml /opt/conda/bin/mamba list cuda
# cuda-python 11.7.1 py39h1eff087_0 conda-forge
# cudatoolkit 11.6.0 hecad31d_10 conda-forge
# dask-cuda 22.4.0 pyhd8ed1ab_1 conda-forge
It seems like adding cuda-python=11.6.1[build=py39*]
as an additional pin is enough to work around the issue (in my Dockerfile above)
Got it, that is as expected then, thanks! The issue is specifically related to cuda-python as explained in this notice, and the two solutions are either pinning cuda-python < 11.7.1
(as you found) or updating to RAPIDS 22.06.01. Our patch release conda package for 22.06 handles the necessary pinning for you.
Describe the bug
import rmm
I have rmm installed through conda with the following commands:
mamba install -c rapidsai -c nvidia cuda rapids compilers
and when trying to import rmm, I run into the following error message in my Jupyter notebook.I solved this, by downgrading
cuda-python
to 11.7.0 from 11.7.1, as referred to in issue 4798 for cuml