rstudio / helm

Helm Resources for RStudio Products
MIT License
34 stars 28 forks source link

general guidance on GPU setup #299

Open airpaio opened 1 year ago

airpaio commented 1 year ago

Does anyone have any experience in running Launcher sessions with GPUs? I know we can set Launcher profiles server.profiles.launcher.kubernetes.profiles.conf with default-nvidia-gpus, but what else is required? In a different project we would configure GPU jobs with tolerations similar to what is seen here https://github.com/NVIDIA/k8s-device-plugin#running-gpu-jobs. Just wondering how this might translate into the LAuncher profiles config for Workbench/Connect?

airpaio commented 1 year ago

I came across #271 which seems related. The first comment there mentions using templates. Still trying to work through this, but maybe these docs will help https://docs.posit.co/job-launcher/kube.html#kube-templating

colearendt commented 1 year ago

Woops - apologies for the delay missing this.

Is the context here more for Workbench or more for Connect?

I have used this functionality with Workbench - when default-nvidia-gpus, max-nvidia-gpus, and friends are enabled inside of Workbench, we display a selector that allows permitted users to decide whether their session uses GPUs / how many (presuming any nodes have more than 1 GPU available). It ultimately finds its way into the job as resources:

https://github.com/rstudio/helm/blob/9d58e030b0eafac5a4834a131685d62dd1c9d0c5/examples/launcher-templates/default/2.1.0/job.tpl#L181-L190

https://docs.posit.co/job-launcher/kube.html#kube-profiles

In my testing, this was sufficient to get a GPU job scheduled properly. If this is not the case, we would love to learn more about what is going wrong! I definitely understood the tolerations shown in your example to be "overkill" in some sense (i.e. in a cluster with many different types of GPUs, make sure it runs on this one). Using templating or job-json-overrides today (which you reference) is unfortunately not a fantastic answer as it would require all jobs for a given user to use those tolerations.

Connect has not exposed any of this functionality to date. It is possible to run all Connect jobs with GPUs (i.e. using the default), but not to select which jobs run with GPUs / not. If this is coming up as an important piece of functionality, we would love to learn more details about why so we can help prioritize the work necessary!