rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.23k stars 884 forks source link

[QST] Why do I get an ModuleNotFoundError? No module named 'rmm._cuda.stream' #15230

Closed blue-cat-whale closed 6 months ago

blue-cat-whale commented 6 months ago

[Resolved] Cleaning and Reinstalling CUDA toolkit solves the problem.

In short, cudf runs well in one environment, but throws an exception in another environment. The two environments have identical cuda-related package versions. I have two machines, one is Ubuntu 22.04 while the other is RHEL8.9. Both machines have CUDA 12 installed. I run python3 -m cudf.pandas my_code.py smoothly on the Ubuntu machine, but got an error on RHEL8.9. The Ubuntu is defined as follow:

FROM nvidia/cuda:12.0.1-devel-ubuntu22.04
RUN apt-get update && apt-get install -y wget && apt-get install curl -y && apt-get install python3-pip -y
ENV PATH=$PATH:~/.local/bin:~/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
RUN pip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple "cudf-cu12>=24.4.0a0,<=24.4" "dask-cudf-cu12>=24.4.0a0,<=24.4" "cuml-cu12>=24.4.0a0,<=24.4" "cugraph-cu12>=24.4.0a0,<=24.4" "dask-cuda>=24.4.0a0,<=24.4"
RUN pip install numpy==1.26.4 pandas==2.2.1 Cython==3.0.8 scikit-learn==1.4.0 swifter==1.4.0 requests==2.31.0 numba==0.59.0 scikit-learn-intelex==2024.1.0

As we can see from pic1 and pic2, CUDA related packages in the two environments have identical version numbers. But when I run the same command on the RHEL8.9 machine, it returns this error:

Traceback (most recent call last):
  File "/usr/lib64/python3.9/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib64/python3.9/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/usr/local/lib64/python3.9/site-packages/cudf/__init__.py", line 9, in <module>
    _setup_numba()
  File "/usr/local/lib64/python3.9/site-packages/cudf/utils/_numba.py", line 124, in _setup_numba
    _get_cc_60_ptx_file()
  File "/usr/local/lib64/python3.9/site-packages/cudf/utils/_numba.py", line 16, in _get_cc_60_ptx_file
    from cudf._lib import strings_udf
  File "/usr/local/lib64/python3.9/site-packages/cudf/_lib/__init__.py", line 4, in <module>
    from . import (
  File "avro.pyx", line 1, in init cudf._lib.avro
ModuleNotFoundError: No module named 'rmm._cuda.stream'

Apparently I've installed rmm-cuda12 (see pic2 and pic3 below) and the two environments have identical version. I've tried removing rmm and cudf and re-installing them, but the problem is still there. PATH=/usr/bin:/home/<user_name>/.local/bin:/home/<user_name>/bin:/usr/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin and PYTHONPATH=/usr/local/lib64/python3.9/site-packages

pic1

pic2

pic3

bdice commented 6 months ago

Thanks for the question. I am not sure what the root cause is for this. Can you try running the following on the RHEL 8 machine?

import rmm
print(rmm.__file__)
print(rmm.__version__)
print(dir(rmm))

I will try to set up a reproducer on Rocky 8 later today (the closest OS to RHEL 8.9).

Also, I see your path has names like condabin. Have you tried conda instead of pip?

blue-cat-whale commented 6 months ago

It turns out that there was a version conflict in the underlying CUDA tookit. The problem is gone after reinstallation.