Closed bdice closed 4 months ago
I'm getting an error: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown
. I am on driver 535 which is an LTS branch, so I thought we wouldn't have any troubles. @trxcllnt Do you have insight on this?
docker run -it nvidia/cuda:12.5.0-base-ubuntu22.04
works fine on this system with driver 535 so I think it is an issue with how our devcontainers are built.
@bdice there's a number of reasons you could be seeing this, none of which we can/are going to change. I recommend installing the latest driver.
@trxcllnt This is on a lab machine where I cannot control the driver. CI and lab machines are only supposed to use LTS or Production Branch drivers, which do not yet support 12.5. We won’t be able to run 12.5 devcontainers in CI (on GPU nodes, at least) or on lab machines.
I thought the discussion we had in Slack concluded that we should not need driver updates to use 12.5 because we use LTS / PB drivers. xref: https://github.com/rapidsai/build-planning/issues/73#issuecomment-2164162911
Which machine are you seeing this on? I just ran docker run --rm --gpus all rapidsai/devcontainers:24.08-cpp-gcc13-cuda12.5 nvidia-smi
on dgx01 w/ 535.161.08
and it worked fine.
I was on dgx05. I will try the command you gave. Maybe it’s something in how I invoked the devcontainer.
docker run --rm --gpus all rapidsai/devcontainers:24.08-cpp-gcc13-cuda12.5 nvidia-smi
works on dgx05 for me. Hmm. Here is the full error log I get when I try to launch the devcontainer on dgx05:
Command: devcontainer up --config .devcontainer/cuda12.5-conda/devcontainer.json --workspace-folder .
@trxcllnt Also, can you help me debug the CI failures? I don't know what is going wrong. The pip container fails to find cudnn
and the conda container fails to find gcc
. I am going to update the branch to see if these issues reoccur.
That looks to be failing w/ the conda container? We don't even install the CTK in the conda container, it's basically just Ubuntu + miniforge.
My guess is the nvidia-container-toolkit is seeing the ENV CUDA_VERSION
and inferring the NVIDIA_REQUIRE_CUDA
constraints automatically.
Does it succeed if you run with --remote-env NVIDIA_DISABLE_REQUIRE=true
?
The conda container is failing to create an env at all because dfg generated emtpy yaml files:
Not creating 'rapids' conda environment because 'rapids.yml' is empty.
Looks like the CUDA feature is trying to install cuDNN v8, but IIRC it's v9 now, so that's why cuDNN isn't getting installed.
The conda container is failing to create an env at all because dfg generated emtpy yaml files:
Ah. I think this job should fail earlier and show the error logs from dfg. CUDA 12.5 doesn't have entries in dependencies.yaml
for any RAPIDS repos yet. I had hoped to run CUDA 12.5 tests in unified devcontainers before opening PRs to every repo. Maybe I will start with the PRs to individual repos and come back to this repo later.
Does it succeed if you run with
--remote-env NVIDIA_DISABLE_REQUIRE=true
?
No, I get the same error when I run devcontainer up --remote-env NVIDIA_DISABLE_REQUIRE=true --config .devcontainer/cuda12.5-conda/devcontainer.json --workspace-folder .
as before.
Looks like the CUDA feature is trying to install cuDNN v8, but IIRC it's v9 now, so that's why cuDNN isn't getting installed.
I updated this in d4ef78e. I wasn't sure if we wanted to keep libcudnn8 for any CUDA versions or not. If so, let me know.
Yeah we need to install the right cuDNN version based on the CUDA toolkit. Maybe we can make the cuDNN version a feature input variable?
Yeah we need to install the right cuDNN version based on the CUDA toolkit. Maybe we can make the cuDNN version a feature input variable?
It looks like cuDNN 9.2.0 is compatible with 11.8 and 12.0-12.5, which would cover all the devcontainers we produce. https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html#support-matrix
Yes but not every library works with cuDNN v9 yet (cupy, for example), so we need a variable to allow installing different versions.
@trxcllnt I'm not sure how to add a variable. Is this something I modify in matrix.yaml
?
Maybe I got it right? I guessed. See deba81b and d8f91e9.
/ok to test
cuDNN v9 isn't getting installed because they changed the names of the packages between 8 and 9. I'll push a commit that fixes it.
Do we need to install cxx-compiler
somewhere and point CMake to it?
Seeing this on CI:
CMake Error at /usr/share/cmake-3.30/Modules/CMakeDetermineCXXCompiler.cmake:48 (message):
Could not find compiler set in environment variable CXX:
/usr/bin/g++.
Call Stack (most recent call first):
CMakeLists.txt:24 (project)
CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!
This PR updates the CUDA default to 12.5 and also adds RAPIDS devcontainers for CUDA 12.5.
Part of https://github.com/rapidsai/build-planning/issues/73.