rapidsai / deployment

RAPIDS Deployment Documentation
https://docs.rapids.ai/deployment/stable/
9 stars 28 forks source link

Upgrade to CUDA `12.2` in Databricks ML Runtimes #299

Closed skirui-source closed 7 months ago

skirui-source commented 9 months ago

CC: @jacobtomlinson

jacobtomlinson commented 9 months ago

I think generally we want to figure out the most ergonomic route to CUDA 12 on Databricks.

Right now when you spin up a node you get CUDA 12.2 with the NVIDIA Drivers on Databricks. However, the Databricks container we are instructing folks to use has CUDA Toolkit 11.8 so you need to install cudf-cu11.

It is possible to upgrade CUDA Toolkit either via an init script or at runtime, but it adds some complexity.

It would also be interesting to explore avoiding using the containers altogether and using the ML Runtime and see what CUDA Toolkit and other versions you get with that. But I tried doing this yesterday and couldn't get to a working RAPIDS setup.

skirui-source commented 7 months ago

@jacobtomlinson I have tested rapids installation in multi-node Databricks cluster with this init script:

#!/bin/bash
set -e

# Install RAPIDS libraries
pip install \
    --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple \
    "cudf-cu12>=24.02.0a0,<=24.02" "dask-cudf-cu12>=24.02.0a0,<=24.02" \
    "cuml-cu12>=24.02.0a0,<=24.02" "dask-cuda>=24.02.0a0,<=24.02"

Successfully worked for the following Databricks ML Runtimes, (installed Driver Version: 535.54.03 && CUDA Version: 12.2)

- 13.3 LTS ML (GPU, Scala 2.12, Spark 3.4.1)
- 14.0, 14.1, 14.2 LTS ML (GPU, Scala 2.12, Spark 3.5.0)

Failed:

 - 12.2 LTS ML (GPU, Scala 2.12, Spark 3.3.2)
 - 14.3 LTS ML Beta(GPU, Scala 2.12, Spark 3.5.0)
skirui-source commented 7 months ago

CC @jacobtomlinson

**Notes:

NEXT STEPS: