rapidsai / devcontainers

18 stars 28 forks source link

Update to CUDA 12.5. #332

Closed bdice closed 4 months ago

bdice commented 4 months ago

This PR updates the CUDA default to 12.5 and also adds RAPIDS devcontainers for CUDA 12.5.

Part of https://github.com/rapidsai/build-planning/issues/73.

bdice commented 4 months ago

I'm getting an error: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown. I am on driver 535 which is an LTS branch, so I thought we wouldn't have any troubles. @trxcllnt Do you have insight on this?

docker run -it nvidia/cuda:12.5.0-base-ubuntu22.04 works fine on this system with driver 535 so I think it is an issue with how our devcontainers are built.

trxcllnt commented 4 months ago

@bdice there's a number of reasons you could be seeing this, none of which we can/are going to change. I recommend installing the latest driver.

bdice commented 4 months ago

@trxcllnt This is on a lab machine where I cannot control the driver. CI and lab machines are only supposed to use LTS or Production Branch drivers, which do not yet support 12.5. We won’t be able to run 12.5 devcontainers in CI (on GPU nodes, at least) or on lab machines.

bdice commented 4 months ago

I thought the discussion we had in Slack concluded that we should not need driver updates to use 12.5 because we use LTS / PB drivers. xref: https://github.com/rapidsai/build-planning/issues/73#issuecomment-2164162911

trxcllnt commented 4 months ago

Which machine are you seeing this on? I just ran docker run --rm --gpus all rapidsai/devcontainers:24.08-cpp-gcc13-cuda12.5 nvidia-smi on dgx01 w/ 535.161.08 and it worked fine.

bdice commented 4 months ago

I was on dgx05. I will try the command you gave. Maybe it’s something in how I invoked the devcontainer.

bdice commented 4 months ago

docker run --rm --gpus all rapidsai/devcontainers:24.08-cpp-gcc13-cuda12.5 nvidia-smi works on dgx05 for me. Hmm. Here is the full error log I get when I try to launch the devcontainer on dgx05:

Command: devcontainer up --config .devcontainer/cuda12.5-conda/devcontainer.json --workspace-folder .

Error log ``` [2024-07-01T21:57:00.173Z] @devcontainers/cli 0.54.2. Node.js v18.15.0. linux 5.4.0-182-generic x64. [2024-07-01T21:57:00.278Z] Running the initializeCommand from devcontainer.json... [2024-07-01T21:57:00.278Z] Start: Run: /bin/bash -c mkdir -m 0755 -p /raid/bdice/compose-environments/rapids1/devcontainers/../.{aws,cache,config,conda/pkgs,conda/devcontainers-cuda12.5-envs,log/devcontainer-utils} /raid/bdice/compose-environments/rapids1/devcontainers/../{rmm,kvikio,ucxx,cudf,raft,cuvs,cumlprims_mg,cuml,cugraph-ops,wholegraph,cugraph,cuspatial} [2024-07-01T21:57:00.283Z] [2024-07-01T21:57:01.403Z] Resolving Feature dependencies for './features/src/utils'... [2024-07-01T21:57:01.405Z] Resolving Feature dependencies for './features/src/rapids-build-utils'... [2024-07-01T21:57:01.472Z] Start: Run: docker buildx build --load --build-arg BUILDKIT_INLINE_CACHE=1 -f /tmp/devcontainercli-bdice/container-features/0.54.2-1719871021400/Dockerfile-with-features -t vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51 --target dev_containers_target_stage --build-arg CUDA=12.5 --build-arg PYTHON_PACKAGE_MANAGER=conda --build-arg BASE=rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04 --build-context dev_containers_feature_content_source=/tmp/devcontainercli-bdice/container-features/0.54.2-1719871021400 --build-arg _DEV_CONTAINERS_BASE_IMAGE=dev_container_auto_added_stage_label --build-arg _DEV_CONTAINERS_IMAGE_USER=root --build-arg _DEV_CONTAINERS_FEATURE_CONTENT_SOURCE=dev_container_feature_content_temp /raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer [2024-07-01T21:57:01.824Z] #0 building with "default" instance using docker driver #1 [internal] load build definition from Dockerfile-with-features #1 transferring dockerfile: 10.44kB done #1 DONE 0.0s #2 resolve image config for docker-image://docker.io/docker/dockerfile:1.5 [2024-07-01T21:57:01.967Z] #2 DONE 0.3s [2024-07-01T21:57:02.077Z] #3 docker-image://docker.io/docker/dockerfile:1.5@sha256:39b85bbfa7536a5feceb7372a0817649ecb2724562a38360f4d6a7782a409b14 #3 CACHED #4 [internal] load .dockerignore [2024-07-01T21:57:02.077Z] #4 transferring context: 2B done #4 DONE 0.0s #5 [internal] load metadata for docker.io/rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04 [2024-07-01T21:57:02.234Z] #5 ... #6 [context dev_containers_feature_content_source] load .dockerignore #6 transferring dev_containers_feature_content_source: 2B done #6 DONE 0.0s [2024-07-01T21:57:02.384Z] #5 [internal] load metadata for docker.io/rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04 [2024-07-01T21:57:03.578Z] #5 DONE 1.5s [2024-07-01T21:57:04.112Z] #7 [conda-base 1/1] FROM docker.io/rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04@sha256:3817fe57e71da3e5667dbd860729dc5011324440e16e31a13c1b751cb71a2103 #7 DONE 0.0s #8 [context dev_containers_feature_content_source] load from client #8 transferring dev_containers_feature_content_source: 275.15kB 0.0s done #8 DONE 0.0s #9 [dev_containers_target_stage 2/5] COPY --from=dev_containers_feature_content_normalize /tmp/build-features/ /tmp/dev-container-features #9 CACHED [2024-07-01T21:57:04.112Z] #10 [dev_containers_feature_content_normalize 1/2] COPY --from=dev_containers_feature_content_source devcontainer-features.builtin.env /tmp/build-features/ #10 CACHED #11 [dev_containers_feature_content_normalize 2/2] RUN chmod -R 0755 /tmp/build-features/ #11 CACHED #12 [dev_containers_target_stage 4/5] RUN --mount=type=bind,from=dev_containers_feature_content_source,source=utils_0,target=/tmp/build-features-src/utils_0 cp -ar /tmp/build-features-src/utils_0 /tmp/dev-container-features && chmod -R 0755 /tmp/dev-container-features/utils_0 && cd /tmp/dev-container-features/utils_0 && chmod +x ./devcontainer-features-install.sh && ./devcontainer-features-install.sh && rm -rf /tmp/dev-container-features/utils_0 #12 CACHED #13 [dev_containers_target_stage 3/5] RUN echo "_CONTAINER_USER_HOME=$( (command -v getent >/dev/null 2>&1 && getent passwd 'root' || grep -E '^root|^[^:]*:[^:]*:root:' /etc/passwd || true) | cut -d: -f6)" >> /tmp/dev-container-features/devcontainer-features.builtin.env && echo "_REMOTE_USER_HOME=$( (command -v getent >/dev/null 2>&1 && getent passwd 'coder' || grep -E '^coder|^[^:]*:[^:]*:coder:' /etc/passwd || true) | cut -d: -f6)" >> /tmp/dev-container-features/devcontainer-features.builtin.env #13 CACHED #14 [dev_containers_target_stage 1/5] RUN mkdir -p /tmp/dev-container-features #14 CACHED #15 [dev_containers_target_stage 5/5] RUN --mount=type=bind,from=dev_containers_feature_content_source,source=rapids-build-utils_1,target=/tmp/build-features-src/rapids-build-utils_1 cp -ar /tmp/build-features-src/rapids-build-utils_1 /tmp/dev-container-features && chmod -R 0755 /tmp/dev-container-features/rapids-build-utils_1 && cd /tmp/dev-container-features/rapids-build-utils_1 && chmod +x ./devcontainer-features-install.sh && ./devcontainer-features-install.sh && rm -rf /tmp/dev-container-features/rapids-build-utils_1 #15 CACHED #16 exporting to image #16 exporting layers done #16 preparing layers for inline cache done #16 writing image sha256:9f663f77db74298f79e8eb1a71e24b251aab14b89f590948d6a526ec1f2949f3 done #16 naming to docker.io/library/vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51 done #16 DONE 0.0s [2024-07-01T21:57:07.334Z] Start: Run: docker run --sig-proxy=false -a STDOUT -a STDERR --mount source=/raid/bdice/compose-environments/rapids1/devcontainers,target=/home/coder/devcontainers,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../rmm,target=/home/coder/rmm,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../kvikio,target=/home/coder/kvikio,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../ucxx,target=/home/coder/ucxx,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cudf,target=/home/coder/cudf,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../raft,target=/home/coder/raft,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuvs,target=/home/coder/cuvs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cumlprims_mg,target=/home/coder/cumlprims_mg,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuml,target=/home/coder/cuml,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph-ops,target=/home/coder/cugraph-ops,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../wholegraph,target=/home/coder/wholegraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph,target=/home/coder/cugraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuspatial,target=/home/coder/cuspatial,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.aws,target=/home/coder/.aws,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.cache,target=/home/coder/.cache,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.config,target=/home/coder/.config,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/pkgs,target=/home/coder/.conda/pkgs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/devcontainers-cuda12.5-envs,target=/home/coder/.conda/envs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.log/devcontainer-utils,target=/var/log/devcontainer-utils,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/utils/opt/devcontainer/bin,target=/opt/devcontainer/bin,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/rapids-build-utils/opt/rapids-build-utils,target=/opt/rapids-build-utils,type=bind,consistency=consistent -l devcontainer.local_folder=/raid/bdice/compose-environments/rapids1/devcontainers -l devcontainer.config_file=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/devcontainer.json -u root --rm --name bdice-rapids-devcontainers-24.08-cuda12.5-conda --gpus all --entrypoint /bin/sh vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51-uid -c echo Container started [2024-07-01T21:57:07.771Z] docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown. Error: Command failed: docker run --sig-proxy=false -a STDOUT -a STDERR --mount source=/raid/bdice/compose-environments/rapids1/devcontainers,target=/home/coder/devcontainers,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../rmm,target=/home/coder/rmm,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../kvikio,target=/home/coder/kvikio,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../ucxx,target=/home/coder/ucxx,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cudf,target=/home/coder/cudf,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../raft,target=/home/coder/raft,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuvs,target=/home/coder/cuvs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cumlprims_mg,target=/home/coder/cumlprims_mg,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuml,target=/home/coder/cuml,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph-ops,target=/home/coder/cugraph-ops,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../wholegraph,target=/home/coder/wholegraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph,target=/home/coder/cugraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuspatial,target=/home/coder/cuspatial,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.aws,target=/home/coder/.aws,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.cache,target=/home/coder/.cache,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.config,target=/home/coder/.config,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/pkgs,target=/home/coder/.conda/pkgs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/devcontainers-cuda12.5-envs,target=/home/coder/.conda/envs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.log/devcontainer-utils,target=/var/log/devcontainer-utils,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/utils/opt/devcontainer/bin,target=/opt/devcontainer/bin,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/rapids-build-utils/opt/rapids-build-utils,target=/opt/rapids-build-utils,type=bind,consistency=consistent -l devcontainer.local_folder=/raid/bdice/compose-environments/rapids1/devcontainers -l devcontainer.config_file=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/devcontainer.json -u root --rm --name bdice-rapids-devcontainers-24.08-cuda12.5-conda --gpus all --entrypoint /bin/sh vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51-uid -c echo Container started trap "exit 0" 15 exec "$@" while sleep 1 & wait $!; do :; done - at J$ (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:462:1253) at $J (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:462:997) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async tAA (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:479:3660) at async CC (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:479:4775) at async NeA (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:612:11107) at async MeA (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:612:10848) {"outcome":"error","message":"Command failed: docker run --sig-proxy=false -a STDOUT -a STDERR --mount source=/raid/bdice/compose-environments/rapids1/devcontainers,target=/home/coder/devcontainers,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../rmm,target=/home/coder/rmm,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../kvikio,target=/home/coder/kvikio,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../ucxx,target=/home/coder/ucxx,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cudf,target=/home/coder/cudf,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../raft,target=/home/coder/raft,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuvs,target=/home/coder/cuvs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cumlprims_mg,target=/home/coder/cumlprims_mg,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuml,target=/home/coder/cuml,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph-ops,target=/home/coder/cugraph-ops,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../wholegraph,target=/home/coder/wholegraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph,target=/home/coder/cugraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuspatial,target=/home/coder/cuspatial,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.aws,target=/home/coder/.aws,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.cache,target=/home/coder/.cache,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.config,target=/home/coder/.config,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/pkgs,target=/home/coder/.conda/pkgs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/devcontainers-cuda12.5-envs,target=/home/coder/.conda/envs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.log/devcontainer-utils,target=/var/log/devcontainer-utils,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/utils/opt/devcontainer/bin,target=/opt/devcontainer/bin,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/rapids-build-utils/opt/rapids-build-utils,target=/opt/rapids-build-utils,type=bind,consistency=consistent -l devcontainer.local_folder=/raid/bdice/compose-environments/rapids1/devcontainers -l devcontainer.config_file=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/devcontainer.json -u root --rm --name bdice-rapids-devcontainers-24.08-cuda12.5-conda --gpus all --entrypoint /bin/sh vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51-uid -c echo Container started\ntrap \"exit 0\" 15\n\nexec \"$@\"\nwhile sleep 1 & wait $!; do :; done -","description":"An error occurred setting up the container."} ```
bdice commented 4 months ago

@trxcllnt Also, can you help me debug the CI failures? I don't know what is going wrong. The pip container fails to find cudnn and the conda container fails to find gcc. I am going to update the branch to see if these issues reoccur.

trxcllnt commented 4 months ago

That looks to be failing w/ the conda container? We don't even install the CTK in the conda container, it's basically just Ubuntu + miniforge.

My guess is the nvidia-container-toolkit is seeing the ENV CUDA_VERSION and inferring the NVIDIA_REQUIRE_CUDA constraints automatically.

Does it succeed if you run with --remote-env NVIDIA_DISABLE_REQUIRE=true?

trxcllnt commented 4 months ago

The conda container is failing to create an env at all because dfg generated emtpy yaml files:

  Not creating 'rapids' conda environment because 'rapids.yml' is empty.
trxcllnt commented 4 months ago

Looks like the CUDA feature is trying to install cuDNN v8, but IIRC it's v9 now, so that's why cuDNN isn't getting installed.

bdice commented 4 months ago

The conda container is failing to create an env at all because dfg generated emtpy yaml files:

Ah. I think this job should fail earlier and show the error logs from dfg. CUDA 12.5 doesn't have entries in dependencies.yaml for any RAPIDS repos yet. I had hoped to run CUDA 12.5 tests in unified devcontainers before opening PRs to every repo. Maybe I will start with the PRs to individual repos and come back to this repo later.

bdice commented 4 months ago

Does it succeed if you run with --remote-env NVIDIA_DISABLE_REQUIRE=true?

No, I get the same error when I run devcontainer up --remote-env NVIDIA_DISABLE_REQUIRE=true --config .devcontainer/cuda12.5-conda/devcontainer.json --workspace-folder . as before.

bdice commented 4 months ago

Looks like the CUDA feature is trying to install cuDNN v8, but IIRC it's v9 now, so that's why cuDNN isn't getting installed.

I updated this in d4ef78e. I wasn't sure if we wanted to keep libcudnn8 for any CUDA versions or not. If so, let me know.

trxcllnt commented 4 months ago

Yeah we need to install the right cuDNN version based on the CUDA toolkit. Maybe we can make the cuDNN version a feature input variable?

bdice commented 4 months ago

Yeah we need to install the right cuDNN version based on the CUDA toolkit. Maybe we can make the cuDNN version a feature input variable?

It looks like cuDNN 9.2.0 is compatible with 11.8 and 12.0-12.5, which would cover all the devcontainers we produce. https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html#support-matrix

trxcllnt commented 4 months ago

Yes but not every library works with cuDNN v9 yet (cupy, for example), so we need a variable to allow installing different versions.

bdice commented 4 months ago

@trxcllnt I'm not sure how to add a variable. Is this something I modify in matrix.yaml?

bdice commented 4 months ago

Maybe I got it right? I guessed. See deba81b and d8f91e9.

trxcllnt commented 4 months ago

/ok to test

trxcllnt commented 4 months ago

cuDNN v9 isn't getting installed because they changed the names of the packages between 8 and 9. I'll push a commit that fixes it.

jakirkham commented 4 months ago

Do we need to install cxx-compiler somewhere and point CMake to it?

Seeing this on CI:

CMake Error at /usr/share/cmake-3.30/Modules/CMakeDetermineCXXCompiler.cmake:48 (message):
  Could not find compiler set in environment variable CXX:

  /usr/bin/g++.

Call Stack (most recent call first):
  CMakeLists.txt:24 (project)

CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!
trxcllnt commented 4 months ago

No, the problem is there's no matrix entries for CUDA 12.5 in dependencies.yaml (e.g. here), causing rapids-dependency-file-generator to output an empty conda environment yaml file and nothing to be installed.