Closed skirui-source closed 7 months ago
I think generally we want to figure out the most ergonomic route to CUDA 12 on Databricks.
Right now when you spin up a node you get CUDA 12.2 with the NVIDIA Drivers on Databricks. However, the Databricks container we are instructing folks to use has CUDA Toolkit 11.8 so you need to install cudf-cu11
.
It is possible to upgrade CUDA Toolkit either via an init script or at runtime, but it adds some complexity.
It would also be interesting to explore avoiding using the containers altogether and using the ML Runtime and see what CUDA Toolkit and other versions you get with that. But I tried doing this yesterday and couldn't get to a working RAPIDS setup.
@jacobtomlinson I have tested rapids installation in multi-node Databricks cluster with this init script:
#!/bin/bash
set -e
# Install RAPIDS libraries
pip install \
--extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple \
"cudf-cu12>=24.02.0a0,<=24.02" "dask-cudf-cu12>=24.02.0a0,<=24.02" \
"cuml-cu12>=24.02.0a0,<=24.02" "dask-cuda>=24.02.0a0,<=24.02"
Successfully worked for the following Databricks ML Runtimes, (installed Driver Version: 535.54.03 && CUDA Version: 12.2)
- 13.3 LTS ML (GPU, Scala 2.12, Spark 3.4.1)
- 14.0, 14.1, 14.2 LTS ML (GPU, Scala 2.12, Spark 3.5.0)
Failed:
- 12.2 LTS ML (GPU, Scala 2.12, Spark 3.3.2)
- 14.3 LTS ML Beta(GPU, Scala 2.12, Spark 3.5.0)
CC @jacobtomlinson
**Notes:
I tried to install CUDA 12.2 directly via init script based on the NVIDIA instructions (first, had issues fetching keyring then with pynvml library). I have googled for workarounds but nothing so far! 😿
When you launch a node with 14.2 ML Runtime, nvidia-smi
shows (installed Driver Version: 535.54.03 && CUDA Version: 12.2).. but the docs and nvcc --version
show that 11.8 is installed
I considered building a custom image (from nvidia/cuda:runtime-12.2.2 base) and adding databricks-specific settings (pyspark installation etc). But Jacob thinks that building custom dockers will add unnecessary complexity to the instructions for users....
NEXT STEPS:
[ ] On the deployment docs, stick to installing rapids version compatible with CUDA toolkit 11.8 until Databricks upgrades to CUDA 12.0+
[ ] For now, focus on PR-- simplify RAPIDS Installation in Databricks ML Runtime without docker runtime containers
[ ] Move this issue to the backlog until further notice
CC: @jacobtomlinson