vyasr commented 9 months ago

Is your feature request related to a problem? Please describe. Once RAPIDS adds support for CUDA 12.2, it will be possible to install conda packages of PyTorch along with RAPIDS from conda. Currently this is not possible because PyTorch supports 12.1 and will likely bump straight to 12.3 for their next set of packages. Since the CUDA 12 lineup of RAPIDS packages is going to leverage CEC to support arbitrary CUDA minor versions, we will no longer need users to have a specific one for RAPIDS, but dependencies like PyTorch will likely continue to do so.

Describe the solution you'd like We should update the release selector to include a range of CUDA minor versions and have it automatically select supported ones based on the user's choice of packages to include in their environment.

Additional context For libraries like PyTorch, we will also need to consider what channel the package will be installed from. Officially supported PyTorch builds come from the pytorch channel, not conda-forge, so unless/until that changes we will need to ensure that our install command accounts for that correctly.

jakirkham commented 9 months ago

Possibly related ( https://github.com/rapidsai/docs/pull/470 )

bdice commented 9 months ago

470 fixes the compatible major versions of CUDA for the TensorFlow GPU conda-forge package. It does not impact minor version compatibility.

What part of this is dependent on RAPIDS supporting CUDA 12.2?

I was able to solve this environment, and got a CUDA 12 build of pytorch from conda-forge (pytorch 2.1.2 cuda120_py310h327d3bc_301).

mamba create -n rapids-23.12 -c rapidsai -c conda-forge -c nvidia rapids=23.12 python=3.10 cuda-version=12.0 pytorch

I don't think we can offer official compatibility between RAPIDS / conda-forge and the pytorch channel, given that the pytorch package from the pytorch channel is built against nvidia channel CUDA packages. These channel conflicts are unavoidable. An example environment showing the mixture of nvidia and conda-forge packages can be generated by adding -c pytorch before -c conda-forge:

# Uses both nvidia and conda-forge CUDA Toolkit packages. Not supported.
mamba create -n rapids-23.12 -c rapidsai -c pytorch -c conda-forge -c nvidia rapids=23.12 python=3.10 cuda-version=12.0 pytorch

Last I tested it, this environment worked but we can't offer support for a configuration with CUDA from a mixed set of channels.

At some point in the future we are hoping to make the CUDA distributions on the nvidia and conda-forge channels compatible, but until that point, I don't see any action item here. The install selector works as desired with PyTorch CUDA 12 packages from conda-forge.

vyasr commented 9 months ago

I agree that this isn't addressable until the nvidia and conda-forge CTK packages are aligned. We should consider how the selector ought to work once that day comes, though. To @MatthiasKohl's point, though, the pytorch channel is the officially supported medium (by both NVIDIA and PyTorch) for installing the package, so IMHO once the two are aligned we would probably want to encourage installation of PyTorch from the pytorch channel unless and until we see a similar level of support for the conda-forge package as NVIDIA is now providing for the CTK on cf.

MatthiasKohl commented 9 months ago

The install selector works as desired with PyTorch CUDA 12 packages from conda-forge.

It might work as desired, but I don't think it should. I checked today with Cliff and Piotr from DLFW, and both our DLFW teams and upstream pytorch have found many incompatibility issues with the pytorch build from conda-forge, e.g. libc version and so on. The problem is that few people install only pytorch and rely on many other packages, which are all either pip-wheel based or based on conda's main channel, and use different base packages. IMO, we should not encourage people to use this pytorch build. If RAPIDS cannot be compatible with upstream pytorch (from officially supported channels), then we should either work with DLFW to become compatible, or remove that option from the install selector.

vyasr commented 3 weeks ago

Big relevant news here: https://github.com/pytorch/pytorch/issues/138506

MatthiasKohl commented 2 weeks ago

There has not been any substantial effort / progress to become compatible with DLFWs since this was last discussed. The fact that PyTorch is deprecating their conda channel means that there will not be any officially supported package of PyTorch on conda, just like for Tensorflow. Thus, we should remove both the PyTorch and Tensorflow options from the install selector.

agm-eratosth commented 1 week ago

There has not been any substantial effort / progress to become compatible with DLFWs since this was last discussed. The fact that PyTorch is deprecating their conda channel means that there will not be any officially supported package of PyTorch on conda, just like for Tensorflow. Thus, we should remove both the PyTorch and Tensorflow options from the install selector.

Rapidsai is often used in conjunction with PyTorch and Tensorflow for many users. Wouldn't it instead make sense to support the conda-forge feedstocks, since they are community driven and pull requests can be made on them? The changes being discussed here can be made for compatibility moving forward with rapids now that the conda-forge channel is the way pytorch will be distributed moving forward on conda.

MatthiasKohl commented 1 week ago

Rapidsai is often used in conjunction with PyTorch and Tensorflow for many users. Wouldn't it instead make sense to support the conda-forge feedstocks, since they are community driven and pull requests can be made on them? The changes being discussed here can be made for compatibility moving forward with rapids now that the conda-forge channel is the way pytorch will be distributed moving forward on conda.

This does make sense, but it definitely requires support from Cliff Woolley and org, so I'd recommend reaching out to them and see what they can support. This will likely take a long time, especially if we can support conda-forge officially, so while this effort is going on, I'd still recommend removing the selector.

rapidsai / docs

[FEA] Make selector choose appropriate CUDA 12.x versions based on dependencies #471

470 fixes the compatible major versions of CUDA for the TensorFlow GPU conda-forge package. It does not impact minor version compatibility.