rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.26k stars 535 forks source link

[BUG] Wrong cuda version is derived, causing "No module named 'pynvjitlink'" #6018

Open maxiuw opened 3 months ago

maxiuw commented 3 months ago

Describe the bug A clear and concise description of what the bug is.

Even though I installed everything with cu11 when I try to import cuml I am getting an error:

line 137, in _setup_numba from pynvjitlink.patch import patch_numba_linker ModuleNotFoundError: No module named 'pynvjitlink'

This is caused because of cudf/utils/_numba.py, line 137 where driver/cuda version is checked and I am not sure why but it indicates to 12. Solution is ofc to install the package but it does not have cu11 implementation. I just commented out line 136-139 but it is not good solution for non-local deployment.

Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

import cuml

error in cudf/utils/_numba.py, line 137, in _setup_numba from pynvjitlink.patch import patch_numba_linker ModuleNotFoundError: No module named 'pynvjitlink'

Expected behavior A clear and concise description of what you expected to happen.


>>> pip list | grep 'cu'   

cubinlinker-cu11          0.3.0.post2
cuda-python               11.8.3
cudf-cu11                 24.6.1
cuml-cu11                 24.6.1
cupy-cuda11x              13.2.0
dask-cuda                 24.6.0
dask-cudf-cu11            24.6.1
distributed-ucxx-cu11     0.38.0
executing                 2.0.1
libucx-cu11               1.15.0.post1
ptxcompiler-cu11          0.8.1.post1
pylibraft-cu11            24.6.0
raft-dask-cu11            24.6.0
rmm-cu11                  24.6.0
torch                     2.1.0+cu118
torchaudio                2.1.0+cu118
torchvision               0.16.0+cu118
ucx-py-cu11               0.38.0
ucxx-cu11                 0.38.0
>>> nvcc -V   
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
>>> from cudf.utils._ptxcompiler import NO_DRIVER, safe_get_versions
>>> safe_get_versions()
((12, 0), (12, 1))
dantegd commented 3 months ago

This error is coming from the cuDF side of things, maybe @galipremsagar @vyasr or @divyegala might be able to give some insight here

galipremsagar commented 3 months ago

cc: @brandon-b-miller curious if you know why this might be happening.

brandon-b-miller commented 3 months ago

pynvjitlink is indeed a cuda 12 specific requirement. However cuDF shouldn't attempt to find it unless it detects that it is in a cuda 12 environment. The question is why you are getting ((12, 0), (12, 1) from safe_get_versions in what is apparently a cuda 11 environment. Ultimately the information is obtained from cuDriverGetVersion, so it's finding cuda 12 somewhere.

@maxiuw can you provide some details on how you constructed the environment and installed things?

maxiuw commented 3 months ago

Sure, what information you need? I use conda env and I install everything inside it. Do you need a list of packages?

brandon-b-miller commented 3 months ago

Thanks @maxiuw . Starting from the base environment (no conda env yet) can you provide the output of nvidia-smi? Then, would you be able to share the steps you used to create the environment you are using that contains cuML? For instance did you install using command line instructions from https://docs.rapids.ai/install or are you creating your conda environment by some other means?