rapidsai / deployment

RAPIDS Deployment Documentation
https://docs.rapids.ai/deployment/stable/
9 stars 28 forks source link

Update Databricks multi-node to CUDA 12 #410

Open jacobtomlinson opened 1 month ago

jacobtomlinson commented 1 month ago

Our current docs for multi-node Databricks cover the following process:

The container images use CUDA 11.8 and there are no CUDA 12 images available from Databricks.

The single-node instructions don't use a custom container at all, so in theory we should be able to do the same with he multi-node instructions.

In practice if you omit the custom container the init scripts fails. The logs show that NVML can't be found during Dask startup. This makes me think that either the NVIDIA Driver or CUDA toolkit are not installed at the time the init script runs and are installed later.

We should find a way to start up dask-databricks without using a custom container and update the documentation.