Open bdice opened 1 month ago
@ajschmidt8 - Please take a look at the possibility of updating this in July.
I’ve spent a bit of time investigating and discussing this topic with others (including @jrhemstad). I’m coming to the conclusion that we may not need any driver updates, because the Production Branch driver we currently use (R550) is supported for CUDA Forward Compatibility with 12.5+ according to this table: https://docs.nvidia.com/deploy/cuda-compatibility/#id3
My local tests on a machine with R535, the LTS driver, also indicate compatibility should be fine. The key here is for us to remain on only Production Branch or LTS Branch drivers!
I propose a change to this plan: we should try to use CUDA 12.5 to build and test for a couple repos (rmm and cudf) and if it works we can update to 12.5 instead of 12.4. I will file PRs to the miniforge-cuda, ci-imgs, and shared-workflows repositories to enable this test.
It is worth noting that CUDA 12.5.1 packages (in various formats) are now out
Also pynvjitlink
has been rebuilt with CUDA 12.5.1: https://github.com/rapidsai/pynvjitlink/pull/95
Looking at this one...
- [ ] Update miniforge-cuda to 12.5.1
- Depends on https://gitlab.com/nvidia/container-images/cuda being updated to 12.5.1
It appears the builds are already pulling in the latest distro packages for CUDA 12.5.1
For example this job from yesterday, shows the following
#13 4.389 cuda-compat-12-5 x86_64 1:555.42.06-1 cuda 38 M
#13 4.389 cuda-cudart-12-5 x86_64 12.5.82-1 cuda 226 k
#13 4.389 cuda-toolkit-12-5-config-common noarch 12.5.82-1 cuda 7.7 k
#13 4.389 cuda-toolkit-12-config-common noarch 12.5.82-1 cuda 7.9 k
#13 4.389 cuda-toolkit-config-common noarch 12.5.82-1 cuda 7.9 k
Note that these match the new versions in CUDA 12.5U1
So looks like this is done already
Though it would be nice to update this miniforge-cuda
line to 12.5.1
, it doesn't seem to be a blocker
Edit: Also it looks like the ci-imgs
were rebuilt more recently. So are already using these images that have 12.5.1 packages
Given this, will go ahead and checking these boxes
cc @KyleFromNVIDIA (as we discussed this offline)
NVIDIA CUDA 12.5.1 images were released recently. So Kyle and I have updated the RAPIDS images to use them
Added a few notes in the OP. More details in the linked PRs
After all the libraries have been updated, I think we'll also want to add rapidsai/base
and rapidsai/notebooks
images, similar to https://github.com/rapidsai/docker/pull/634. Is that tracked anywhere?
I don't know enough about the resource constraints on Docker Hub to say with confidence whether that change should look like just "add CUDA 12.5 images" or should be "stop publishing CUDA 12.2 images and start publishing CUDA 12.5 images". Maybe @raydouglass or @ajschmidt8 could comment on that.
You can see in https://hub.docker.com/r/rapidsai/notebooks/tags that the tagging scheme has -cuda{major}.{minor}
in it.
We noticed the .devcontainer
s contents were updated for CUDA 12.5, but the .devcontainer
s path names were not. Submitted a few PRs to cleanup already merged ones. Also pushed changes to open PRs to fix this. Made a note for us to update our PR generation scripts to address this going forward
Just an update here, we have gotten the bulk of RAPIDS projects onto CUDA 12.5
One exception is cuSpatial where we are noticing issues with the notebook tests. This occurs without CUDA 12.5 as well. James has been investigating in PR ( https://github.com/rapidsai/cuspatial/pull/1407 ) and will be following up with the cuSpatial team on next steps
I'm proposing switching to CUDA 12.5 images + Python 3.11 in the docs at https://docs.rapids.ai/deployment
With the recent CCCL update (https://github.com/rapidsai/rapids-cmake/pull/607), we should now be able to build RAPIDS with CUDA versions 12.5 and older.
We have CUDA driver R550 in CI now, ~which only supports up to CUDA 12.4, so that's the latest version we could adequately test. CUDA 12.5 needs driver R555, which does not yet have a production branch (PB) or long-term support (LTS) release.~
edit: R550 is a Production Branch driver, and therefore supports CUDA Forward Compatibility with CUDA 12.5 containers. This means we are able to support CUDA 12.5 (the latest version at the time of writing).
I propose to update CI images, shared workflows, devcontainers, etc. to replace CUDA 12.2 with CUDA 12.5. We would retain CI testing for CUDA 12.0 as a lower bound of 12.x. ~This will also align with PyTorch's upcoming CUDA 12.4 support (there have been a series of PRs adding CUDA 12.4 support like https://github.com/pytorch/builder/pull/1720).~ edit: We will upgrade to the latest CUDA, 12.5, instead of 12.4. I will separately address the issues of CUDA compatibility questions between RAPIDS and PyTorch by working on our docs and release selector (see also: https://github.com/rapidsai/build-infra/issues/55).
Tasks
We can start this work now (not blocked by 12.5.1 updates above):
cuda-version
matrix entry for 12.5.github/workflows/
to use shared-workflows branchmatrix_filter
entries using 12.2 to 12.5rapidsai/docker
(https://github.com/rapidsai/docker/pull/689)Once all repos are migrated, merge the
shared-workflows
PR and then revert to the current defaultshared-workflows
branch.Docs changes (wait until all repos are migrated):