Update docs to install RAPIDS in ML Databricks Runtimes (via init scripts)

rapidsai / deployment

RAPIDS Deployment Documentation

https://docs.rapids.ai/deployment/stable/

9 stars 28 forks source link

Update docs to install RAPIDS in ML Databricks Runtimes (via init scripts) #324

Closed skirui-source closed 7 months ago

skirui-source commented 7 months ago

Fixes: https://github.com/rapidsai/deployment/issues/299

Right now when you spin up a node you get CUDA 12.2 with the NVIDIA Drivers on Databricks. However, the Databricks container we are instructing to users has CUDA Toolkit 11.8 so you need to install cudf-cu11.

After testing, I was able to successfully pip install RAPIDS via init script using the ML Runtimes. This PR will update the docs, to avoid using the tensorflow/pytorchdocker runtime containers altogether.

Successfully worked for the following Databricks ML Runtimes, (installed Driver Version: 535.54.03 && CUDA Version: 12.2)

- 13.3 LTS ML (GPU, Scala 2.12, Spark 3.4.1)
- 14.0, 14.1, 14.2 ML (GPU, Scala 2.12, Spark 3.5.0)

Failed:

 - 12.2 LTS ML (GPU, Scala 2.12, Spark 3.3.2)
 - 14.3 LTS ML Beta(GPU, Scala 2.12, Spark 3.5.0)

skirui-source commented 7 months ago

Update:

For both single and multi-node clusters, I am able to successfully pip install (and import) Rapids and other libraries via init script with the 14.2 LTS ML Runtime (GPU, Scala 2.12, Spark 3.5.0).

However, with the same ML Runtime, i experience issues launching a multi-node Dask cluster with dask-databricks, i.e

dask databricks run : the cluster successfully launches and connects to dask scheduler and workers but fails to submit tasks to the client : RuntimeError: CuPy failed to load libnvrtc.so.12: OSError: libnvrtc.so.12: cannot open shared object file: No such file or directory
dask databricks run --cuda completely fails to launch the cluster. The init script logs show : OSError: libnvidia-ml.so.1: cannot open shared object file: No such file or directory and pynvml.nvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found

skirui-source commented 7 months ago

@jacobtomlinson , did you want the init scripts stored in public S3 bucket (i filed ops ticket) or should we leave as is?

Also, Ready for Review

jacobtomlinson commented 7 months ago

Let's open a separate ticket for putting a copy in S3 to simplify things (would you mind opening that?). Then we can get this merged.

skirui-source commented 7 months ago

i have already filed issue with ops team to request a copy in S3 bucket. I think we can go ahead and merge this unless you have any additional feedback?