rapidsai / rmm

RAPIDS Memory Manager
https://docs.rapids.ai/api/rmm/stable/
Apache License 2.0
475 stars 194 forks source link

[BUG] RMM fails to import when using cuda-python above 11.7.0 #1068

Closed dxm447 closed 2 years ago

dxm447 commented 2 years ago

Describe the bug import rmm

I have rmm installed through conda with the following commands:

mamba install -c rapidsai -c nvidia cuda rapids compilers and when trying to import rmm, I run into the following error message in my Jupyter notebook.

C function cuda.ccudart.cudaStreamSynchronize has wrong signature (expected __pyx_t_4cuda_7ccudart_cudaError_t (__pyx_t_4cuda_7ccudart_cudaStream_t), got cudaError_t (cudaStream_t))

I solved this, by downgrading cuda-python to 11.7.0 from 11.7.1, as referred to in issue 4798 for cuml

beckernick commented 2 years ago

RMM 22.06.01 should be picked up by default from the rapidsai conda channel on supported systems when using the rapids metapackage (which pins to 11.7.0). Does this happen if you set rapids=22.06?

Could you share provide additional information about your system and share the output of your mamba install command?

dxm447 commented 2 years ago

This is my output:

`Updating specs:

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Confirm changes: [Y/n] `

vyasr commented 2 years ago

@dxm447 could you confirm that this problem does not persist if you attempt to recreate the environment now? It is not clear from the above output which package versions were initially installed, but as @beckernick pointed out any of our 22.06.00 packages would have pulled in the incorrect cuda-python version. rmm and cudf were updated two weeks ago, but cugraph was only updated the day before this issue was created (and I do see it in your environment) so it's possible that you were still on an older version of cugraph that was missing the pinning.

ryan-williams commented 2 years ago

tldr:

mamba install -y -c conda-forge -c rapidsai -c nvidia "cuml=22.04[build=cuda11_py39*]" "cudatoolkit=11.6"

picks up cuda-python=11.7.1, and then import cuml fails with:

…
  File "/opt/conda/lib/python3.9/site-packages/rmm/__init__.py", line 16, in <module>
    from rmm import mr
  File "/opt/conda/lib/python3.9/site-packages/rmm/mr.py", line 14, in <module>
    from rmm._lib.memory_resource import (
  File "/opt/conda/lib/python3.9/site-packages/rmm/_lib/__init__.py", line 15, in <module>
    from .device_buffer import DeviceBuffer
  File "device_buffer.pyx", line 1, in init rmm._lib.device_buffer
TypeError: C function cuda.ccudart.cudaStreamSynchronize has wrong signature (expected __pyx_t_4cuda_7ccudart_cudaError_t (__pyx_t_4cuda_7ccudart_cudaStream_t), got cudaError_t (cudaStream_t))

It seems like adding cuda-python=11.6.1[build=py39*] as an additional pin is enough to work around the issue (in my Dockerfile below)


Here is a Dockerfile that reproduces this:

FROM nvidia/cuda:11.6.1-base-ubuntu20.04

ENV PYTHON_VERSION_SHORT=39
ENV PYTHON_VERSION_FULL=3.9.13
ENV CONDA_VERSION_FULL=4.12.0
ENV MAMBA_VERSION_FULL=0.24.0
ENV CUDA_VERSION_MINOR=11.6
ENV RAPIDS_VERSION=22.04

ENV PATH=${PATH}:/opt/conda/bin

RUN apt-get update \
 && apt-get install -y wget \
 && wget -q "https://repo.anaconda.com/miniconda/Miniconda3-py${PYTHON_VERSION_SHORT}_${CONDA_VERSION_FULL}-Linux-x86_64.sh" -O ~/miniconda.sh \
 && /bin/bash ~/miniconda.sh -b -p /opt/conda \
 && apt-get clean \
 && conda install -y -c conda-forge "conda=${CONDA_VERSION_FULL}" "python=${PYTHON_VERSION_FULL}" "mamba=${MAMBA_VERSION_FULL}" pip \
 && mamba install -y -c conda-forge -c rapidsai -c nvidia "cuml=${RAPIDS_VERSION}[build=cuda11_py${PYTHON_VERSION_SHORT}*]" "cudatoolkit=${CUDA_VERSION_MINOR}" \
 && conda clean -afy

ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/cuda-${CUDA_VERSION_MINOR}/compat"

ENTRYPOINT ["python"]
CMD ["-c", "import cuml"]  # ❌ fails: TypeError: C function cuda.ccudart.cudaStreamSynchronize has wrong signature

Build:

docker build -t import-cuml .

Run:

docker run --rm import-cuml

I also pushed it to DockerHub: runsascoded/import-cuml:

docker pull runsascoded/import-cuml  # 2.55G compressed, sorry, I tried squash+clean tricks I know of

docker run --rm runsascoded/import-cuml
# Traceback (most recent call last):
#   File "<string>", line 1, in <module>
#   File "/opt/conda/lib/python3.9/site-packages/cuml/__init__.py", line 17, in <module>
#     from cuml.common.base import Base
#   File "/opt/conda/lib/python3.9/site-packages/cuml/common/__init__.py", line 17, in <module>
#     from cuml.common.array import CumlArray
#   File "/opt/conda/lib/python3.9/site-packages/cuml/common/array.py", line 25, in <module>
#     from cudf import DataFrame
#   File "/opt/conda/lib/python3.9/site-packages/cudf/__init__.py", line 5, in <module>
#     validate_setup()
#   File "/opt/conda/lib/python3.9/site-packages/cudf/utils/gpu_utils.py", line 20, in validate_setup
#     from rmm._cuda.gpu import (
#   File "/opt/conda/lib/python3.9/site-packages/rmm/__init__.py", line 16, in <module>
#     from rmm import mr
#   File "/opt/conda/lib/python3.9/site-packages/rmm/mr.py", line 14, in <module>
#     from rmm._lib.memory_resource import (
#   File "/opt/conda/lib/python3.9/site-packages/rmm/_lib/__init__.py", line 15, in <module>
#     from .device_buffer import DeviceBuffer
#   File "device_buffer.pyx", line 1, in init rmm._lib.device_buffer
# TypeError: C function cuda.ccudart.cudaStreamSynchronize has wrong signature (expected __pyx_t_4cuda_7ccudart_cudaError_t (__pyx_t_4cuda_7ccudart_cudaStream_t), got cudaError_t (cudaStream_t))

It seems that installing 11.6 versions of some Rapids libraries picks up 11.7 versions that have C functions with incompatible type signatures.

In my original project, I'm installing a few Rapids/CUDA libraries directly (cudf, cugraph, cuml, cudatoolkit) to save time+space vs. a full rapids install (I think cuSpatial in particular was bringing in a large group of geo-related dependencies, and associated conda/mamba "solve" issues), and I ran into this. I pin Rapids 22.04.x or 22.06.x and CUDA 11.6, and at some point in the last month or so (probably when 11.7 releases started happening), the build broke because of this.

ryan-williams commented 2 years ago

Here's an easy way to see the 11.6/11.7 mix that I end up with in the Dockerfile above (where I tried to pin 11.6):

docker run --rm runsascoded/import-cuml /opt/conda/bin/mamba list cuda
# cuda-python               11.7.1           py39h1eff087_0    conda-forge
# cudatoolkit               11.6.0              hecad31d_10    conda-forge
# dask-cuda                 22.4.0             pyhd8ed1ab_1    conda-forge
ryan-williams commented 2 years ago

It seems like adding cuda-python=11.6.1[build=py39*] as an additional pin is enough to work around the issue (in my Dockerfile above)

vyasr commented 2 years ago

Got it, that is as expected then, thanks! The issue is specifically related to cuda-python as explained in this notice, and the two solutions are either pinning cuda-python < 11.7.1 (as you found) or updating to RAPIDS 22.06.01. Our patch release conda package for 22.06 handles the necessary pinning for you.