Add support for CUDA 12.2 conda packages

vyasr commented 6 months ago

We would like to start publishing conda packages that support versions of CUDA newer than CUDA 12.0. At the moment, this is blocked on efforts to get the CTK on conda-forge updated to a sufficiently new version. As of this writing, we are currently updating the conda-forge CTK to 12.1.1. Our plan is to continue the cf update process, and whatever the latest version of the CTK is that's available via cf on Jan 8, 2024, we will use that version for building RAPIDS 24.02 packages.

Assuming that #7 is completed before this, the main tasks will be to:

[x] Modify the conda shared-workflows to use the new images in building conda packages These jobs should be set up to continue on error
[x] Follow up with different RAPIDS developer teams to address any issues with builds that arise in CUDA 12.x builds. This job will mostly involve coordinating a response from the teams; the assignee of this issue is not responsible for actually fixing said builds.

Step 2 above will likely involve making updates to dependency files in various RAPIDS repos.

This issue will be filled out more and updated once the conda-forge updates are completed and the version finalized.

jameslamb commented 6 months ago

https://github.com/conda-forge/cuda-feedstock/issues/13 is tracking the rollout of CUDA 12.2 to conda-forge.

bdice commented 6 months ago

We will need to resolve a blocking issue, since RAPIDS requires gcc 11 and that conflicts with cuda-nvcc which requires gcc 12. I have filed a PR with a fix.

jakirkham commented 6 months ago

Both of those issues are resolved

jakirkham commented 6 months ago

Should add the gcc constraint only affected cuda-nvcc (usually used in dev environments)

cuda-nvcc_{{ target_platform }} is what is used by {{ compiler("cuda") }} and did not have this issue

jakirkham commented 6 months ago

We merged pynvjitlink's Conda package builds yesterday ( https://github.com/rapidsai/pynvjitlink/pull/33 )

Have started a PR to release pynvjitlink 0.1.7 ( https://github.com/rapidsai/pynvjitlink/pull/42 ), which will be needed to build and upload the Conda packages

jakirkham commented 6 months ago

Also James has submitted PRs adding Conda & Wheels to RAPIDS projects

A full listing of the PRs with current status is in this comment ( https://github.com/rapidsai/build-planning/issues/7#issuecomment-1887577440 )

bdice commented 5 months ago

We will need to modify conda recipes to enforce CUDA Minor Version Compatibility (MVC). This stems from a discussion I started here: https://github.com/rapidsai/raft/pull/2092#issuecomment-1900991184

There are three (two?) basic issues.

1. Ignore run-exports from `compiler('cuda')`

rmm example commit: https://github.com/rapidsai/rmm/pull/1419/commits/ff8ea2d4672069e2a6087ed45c59807caf3ed0b4

The compiler('cuda') has a strong run-export of cuda-version. We discussed this and decided this is a good and intentional behavior for the cuda-nvcc compiler package, because not all CUDA software obeys the rules for Minor Version Compatibility. However, RAPIDS does. We need to ignore this run-export, because it will prevent packages built with CUDA 12.2 from being installed with cuda-version=12.0 or similar.

Concretely, this means updating the existing sections that are ignoring run exports from the CUDA 11 compiler package, and adding the CUDA 12 compiler package as shown in the rmm example.

2. Fix host/run dependencies so we do not inherit incorrect run-exports

rmm example commit: https://github.com/rapidsai/rmm/pull/1419/commits/135c25916dd2b56b48c159375aada38f8a099ace

Initially, when we created CUDA 12 conda packages for RAPIDS, we relied on putting -dev packages in the host environment so that they would create run_exports and add the non-dev package to the run dependencies. For example, we added cuda-cudart-dev to rmm's host section, and it created a run dependency on cuda-cudart. This was fine for CUDA 12.0 but this strategy is not compatible with CUDA Minor Version Compatibility. If we build with CUDA 12.2, the cuda-cudart-dev package will export the subpackage cuda-cudart from the same recipe with a max pin. This means that recipes built with CUDA 12.2's cuda-cudart-dev cannot be installed with cuda-version=12.0.

This rule applies to all -dev libraries, including cuda-cudart and math libraries. For some of these libraries, we do not need them in host to build the RAPIDS package. That is true for rmm's usage of cuda-cudart-dev in host (the compiler itself also adds cuda-cudart-dev to the build dependencies, but the runtime library is not a strong run-export so it doesn't cross from build to run). For other cases, like libcuml's usage of math libraries, we will need to add ignore_run_exports_from and list those dev libraries in libcuml's recipe, to ensure MVC works as intended.

3. Force `cuda-version==${RAPIDS_CUDA_VERSION%.*}` in all conda commands?

This one is more questionable, and may require no action. We saw a problem in cucim's CI where cupy 13 caused cucim to need an explicit cuda-version specification while installing cucim/libcucim. My hope was that adding cuda-version==${RAPIDS_CUDA_VERSION%.*} to the installation commands in CI would cause an error during the solve if either of the problems described above in (1) and (2) were encountered. I tried this here while playing with rmm, and found that pinning cuda-version didn't actually cause a solve error -- it just forced fallback to the latest nvidia channel CUDA packages (CI logs). This was attempted before I applied the changes for (1) and (2), so I would want this to fail in CI. It succeeded but with the wrong channels/packages.

  Package             Version  Build                       Channel               Size
───────────────────────────────────────────────────────────────────────────────────────
  Install:
───────────────────────────────────────────────────────────────────────────────────────

  + cuda-cudart      12.3.101  0                           nvidia               214kB
  + gtest              1.14.0  h2a328a1_1                  conda-forge          394kB
  + fmt                10.2.1  h2a328a1_0                  conda-forge          190kB
  + gmock              1.14.0  h8af1aa0_1                  conda-forge            7kB
  + spdlog             1.12.0  h6b8df57_2                  conda-forge          183kB
  + librmm        24.04.00a20  cuda12_240130_gd4f8aa23_20  /tmp/cpp_channel       2MB
  + librmm-tests  24.04.00a20  cuda12_240130_gd4f8aa23_20  /tmp/cpp_channel       4MB

This isn't desirable, so I don't think pinning cuda-version will do anything to help us enforce that CI test jobs are using the versions we intend. Therefore, I propose that we act on items (1) and (2), and skip (3) for now.

bdice commented 4 months ago

I think this can be closed, since conda support is complete. Final tasks are being tracked here: https://github.com/rapidsai/build-planning/issues/7#issuecomment-1957227696

rapidsai / build-planning