rstudio / helm

Helm Resources for RStudio Products
MIT License
36 stars 28 forks source link

Workbench: Tolerations for specific Pods (GPU) #447

Open kenchrcum opened 11 months ago

kenchrcum commented 11 months ago

Hi everybody, we are currently trying to deploy Workbench on our Kubernetes cluster via Helm. Everything works fine, but we have some hardware GPU nodes, which should be reserved for Workbench GPU Sessions. We do not have any problems starting the GPU Sessions, but we can't get the node "reserved" for these sessions. We are trying to do this tainting the nodes, but we can't get the toleration exclusively on the GPU sessions. After reading through the chart and other repo issues it seems that it is only possible to set taints for all sessions of a Workbench server. We hoped placement-constraints would help us solving the task, but this isn't working as expected, as it looks at the labels of a node. Is there any chance to make this work? Are we just missing some documentation or is this totally out of scope?

Thanks in advance for any help or suggestion :)

iamsarat commented 10 months ago

+1

iamsarat commented 10 months ago

We need a way to exclusively use GPU nodes for ONLY GPU resource requests and current configuration doesn't support this.

colearendt commented 10 months ago

Thanks for reporting this! I think you are right that this is less than ideal. If you are trying to set a toleration exclusively on a GPU session, that is something that may be possible by customizing templates. Customizing templates is generally a pretty advanced feature (and can definitely be tedious / annoying across chart versions), but it should be able to get you going here!

Can you share an example of a toleration as you would expect it to be defined on the pod that is launched? I should be able to mock up some helm values that can work with that input!

kenchrcum commented 9 months ago

Sorry for the long delay and thank you for your reply.

One Taint we would set on the GPU Node is for example nvidia-gpu=server:NoSchedule and we would need to set the according toleration on GPU workbench sessions.