rapidsai / deployment

RAPIDS Deployment Documentation
https://docs.rapids.ai/deployment/stable/
9 stars 28 forks source link

RAPIDS/Dask Databricks MNMG #296

Closed jacobtomlinson closed 3 months ago

jacobtomlinson commented 10 months ago

This issue is intended to be a high-level tracking issue for supporting Dask RAPIDS deployments on Databricks.

High-level goals

Databricks multi-node technical architecture

When you launch a multi-node cluster on Databricks a Spark driver node and many Spark worker nodes are provisioned. When you run a notebook session the notebook kernel is executed on the driver node and the Spark cluster can be leveraged using pyspark.

To use Spark RAPIDS with this Databricks cluster the user must provide an init script that downloads the rapids-4-spark-xxxx.jar plugin and then configure Spark to load this plugin. Spark queries will then leverage libcudf under the hood and benefit from GPU acceleration.

Adding Dask to Databricks clusters

The cluster architecture of having a driver and workers is also the same as a Dask cluster which has a scheduler and workers on different nodes, and we see from the Spark RAPIDS example that Databricks provides a mechanism to run a script on every node at startup to install plugins.

This paradigm of having a script run on every node at startup means that we could use this to create a Dask Runner (xref https://github.com/dask/community/issues/346) that bootstraps a Dask cluster as part of the init script process.

When the init script is run it appears to have access to a number of environment variables. The most useful of these are DB_IS_DRIVER and DB_DRIVER_IP. These variables provide all the information we need to ensure that the driver node starts a dask scheduler process, and the worker nodes create a dask cuda worker process that connects to the scheduler. These processes would need to be started in the background so that the init script can complete and move on with the rest of the Spark cluster startup.

Then when using a notebook with the kernel being run on the driver node we should be able to use dask_cudf and other RAPIDS Dask integrations to communicate with the Dask scheduler on localhost, the same way pyspark does.

Deliverables

### Tasks
- [x] Create init script that prints out who is the scheduler and who is the workers
- [x] Create an init script (ideally in Python) that starts a Dask cluster as background processes
- [x] Package that script as a Python library/CLI for easy distribution
- [ ] https://github.com/rapidsai/deployment/issues/228
- [ ] https://github.com/rapidsai/deployment/issues/298
- [ ] https://github.com/rapidsai/deployment/issues/312

Other info

@jacobtomlinson has started experimenting with the Dask Runners concept in https://github.com/jacobtomlinson/dask-runners (which will likely move to the dask-contrib org at some point). This is probably a reasonable place to prototype this in a PR.