rapidsai / build-planning

Tracking for RAPIDS-wide build tasks
https://github.com/rapidsai
0 stars 1 forks source link

Update latest CUDA version for build/test to 12.5 #73

Open bdice opened 1 month ago

bdice commented 1 month ago

With the recent CCCL update (https://github.com/rapidsai/rapids-cmake/pull/607), we should now be able to build RAPIDS with CUDA versions 12.5 and older.

We have CUDA driver R550 in CI now, ~which only supports up to CUDA 12.4, so that's the latest version we could adequately test. CUDA 12.5 needs driver R555, which does not yet have a production branch (PB) or long-term support (LTS) release.~

edit: R550 is a Production Branch driver, and therefore supports CUDA Forward Compatibility with CUDA 12.5 containers. This means we are able to support CUDA 12.5 (the latest version at the time of writing).

I propose to update CI images, shared workflows, devcontainers, etc. to replace CUDA 12.2 with CUDA 12.5. We would retain CI testing for CUDA 12.0 as a lower bound of 12.x. ~This will also align with PyTorch's upcoming CUDA 12.4 support (there have been a series of PRs adding CUDA 12.4 support like https://github.com/pytorch/builder/pull/1720).~ edit: We will upgrade to the latest CUDA, 12.5, instead of 12.4. I will separately address the issues of CUDA compatibility questions between RAPIDS and PyTorch by working on our docs and release selector (see also: https://github.com/rapidsai/build-infra/issues/55).

Tasks

We can start this work now (not blocked by 12.5.1 updates above):

Once all repos are migrated, merge the shared-workflows PR and then revert to the current default shared-workflows branch.

Docs changes (wait until all repos are migrated):

mmccarty commented 1 month ago

@ajschmidt8 - Please take a look at the possibility of updating this in July.

bdice commented 1 month ago

I’ve spent a bit of time investigating and discussing this topic with others (including @jrhemstad). I’m coming to the conclusion that we may not need any driver updates, because the Production Branch driver we currently use (R550) is supported for CUDA Forward Compatibility with 12.5+ according to this table: https://docs.nvidia.com/deploy/cuda-compatibility/#id3

My local tests on a machine with R535, the LTS driver, also indicate compatibility should be fine. The key here is for us to remain on only Production Branch or LTS Branch drivers!

I propose a change to this plan: we should try to use CUDA 12.5 to build and test for a couple repos (rmm and cudf) and if it works we can update to 12.5 instead of 12.4. I will file PRs to the miniforge-cuda, ci-imgs, and shared-workflows repositories to enable this test.

jakirkham commented 3 weeks ago

It is worth noting that CUDA 12.5.1 packages (in various formats) are now out

Also pynvjitlink has been rebuilt with CUDA 12.5.1: https://github.com/rapidsai/pynvjitlink/pull/95

jakirkham commented 3 weeks ago

Looking at this one...

It appears the builds are already pulling in the latest distro packages for CUDA 12.5.1

For example this job from yesterday, shows the following

#13 4.389  cuda-compat-12-5                x86_64 1:555.42.06-1           cuda       38 M
#13 4.389  cuda-cudart-12-5                x86_64 12.5.82-1               cuda      226 k
#13 4.389  cuda-toolkit-12-5-config-common noarch 12.5.82-1               cuda      7.7 k
#13 4.389  cuda-toolkit-12-config-common   noarch 12.5.82-1               cuda      7.9 k
#13 4.389  cuda-toolkit-config-common      noarch 12.5.82-1               cuda      7.9 k

Note that these match the new versions in CUDA 12.5U1

So looks like this is done already

Though it would be nice to update this miniforge-cuda line to 12.5.1, it doesn't seem to be a blocker


Edit: Also it looks like the ci-imgs were rebuilt more recently. So are already using these images that have 12.5.1 packages

Given this, will go ahead and checking these boxes

jakirkham commented 3 weeks ago

cc @KyleFromNVIDIA (as we discussed this offline)

jakirkham commented 2 weeks ago

NVIDIA CUDA 12.5.1 images were released recently. So Kyle and I have updated the RAPIDS images to use them

Added a few notes in the OP. More details in the linked PRs

jameslamb commented 2 weeks ago

After all the libraries have been updated, I think we'll also want to add rapidsai/base and rapidsai/notebooks images, similar to https://github.com/rapidsai/docker/pull/634. Is that tracked anywhere?

I don't know enough about the resource constraints on Docker Hub to say with confidence whether that change should look like just "add CUDA 12.5 images" or should be "stop publishing CUDA 12.2 images and start publishing CUDA 12.5 images". Maybe @raydouglass or @ajschmidt8 could comment on that.

You can see in https://hub.docker.com/r/rapidsai/notebooks/tags that the tagging scheme has -cuda{major}.{minor} in it.

jakirkham commented 2 weeks ago

We noticed the .devcontainers contents were updated for CUDA 12.5, but the .devcontainers path names were not. Submitted a few PRs to cleanup already merged ones. Also pushed changes to open PRs to fix this. Made a note for us to update our PR generation scripts to address this going forward

jakirkham commented 2 weeks ago

Just an update here, we have gotten the bulk of RAPIDS projects onto CUDA 12.5

One exception is cuSpatial where we are noticing issues with the notebook tests. This occurs without CUDA 12.5 as well. James has been investigating in PR ( https://github.com/rapidsai/cuspatial/pull/1407 ) and will be following up with the cuSpatial team on next steps

jameslamb commented 1 week ago

I'm proposing switching to CUDA 12.5 images + Python 3.11 in the docs at https://docs.rapids.ai/deployment

https://github.com/rapidsai/deployment/pull/398