Open dcarrion87 opened 2 years ago
Figured this out. Noticed all the ray images are baking in NVIDIA_VISIBLE_DEVICES=all
nvidia-runtime config has this set accept-nvidia-visible-devices-envvar-when-unprivileged = true
So it's just assigning everything in.
Thanks for reporting this issue @dcarrion87! Glad you've resolved it. Would it help if we made this more clear in the docs? cc @jjyao @DmitriGekhtman
I think so...
Looks like this is an issue in multiple projects. Noticed it with Kubeflow. What an absolute pain. Volume mounts may be the answer? https://docs.google.com/document/d/1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw/edit#
Regardless of nvidia.com/gpu limits the head and workers nodes are coming up with all 8 GPU devices every time. E.g. I limit to 2 on a worker node and it comes up with 8. I specify none for head node it comes up with 8.
Could you explain what you mean by "comes up with 8 GPUs"? Which system registers the presence of the GPUs? What's the precise undesirable behavior?
@DmitriGekhtman in a Kubernetes environment where we want GPU limits to be controlled by the NVIDIA GPU operator having NVIDIA_VISIBLE_DEVICES=all as an envvar in the images means that all GPUs on the host comes up instead of the controlled amount specified via "nvidia.com/gpu" limit. This document describes the undesirable affect and the changes to the spec to minimise this from happening using volume mounts: https://docs.google.com/document/d/1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw/edit#
Thanks, that makes sense.
cc @kevin85421
This is most likely an issue that should be solved in the KubeRay repo, but it's fine to track it here. I'd consider this a blocker for the KubeRay 0.4.0 release.
@DmitriGekhtman @kevin85421
We were able to solve this in case you want to document the approach anywhere. It heavily relies on a locked down Kubernetes workspace environment that we provide to users where they can just change the image and that's all.
/etc/nvidia-container-runtime/config.toml
to support both volume and envvar based assignmentaccept-nvidia-visible-devices-envvar-when-unprivileged = true
accept-nvidia-visible-devices-as-volume-mounts = true
You could set envvar to
false
but we are leveraging code in the toolkit the does volume mount first and then bails if it finds. We still need unprivileged envvar capabilities for another use case where we want to share the same GPU between 2 containers in the same pod and need to inject the GPU ID from the gpu hog container into the other unprivileged container.
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: gpu-operator
namespace: kube-system
spec:
# Set explicitly - NVIDIA breaks CRDs between versions
chart: https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v1.11.1.tgz
targetNamespace: gpu-operator
set:
driver.enabled: "false"
toolkit.enabled: "false"
devicePlugin.env[0].name: "DEVICE_LIST_STRATEGY"
devicePlugin.env[0].value: "volume-mounts"
Change the default kuberay cluster chart values to override head and worker variables to ensure it can never get all GPUs via the envvar method: NVIDIA_VISIBLE_DEVICES=void
Rely on the volume mount strategy to assign the correct GPUs into the unprivileged containers using nvidia.com/gpu
limit.
Looks like NVIDIA is doing something in the future with upcoming Kubernetes versions which is going to stop this mayhem in shared environments: https://gitlab.com/nvidia/cloud-native/k8s-dra-driver
@DmitriGekhtman @kevin85421
We were able to solve this in case you want to document the approach anywhere. It heavily relies on a locked down Kubernetes workspace environment that we provide to users where they can just change the image and that's all.
- Update
/etc/nvidia-container-runtime/config.toml
to support both volume and envvar based assignmentaccept-nvidia-visible-devices-envvar-when-unprivileged = true accept-nvidia-visible-devices-as-volume-mounts = true
You could set envvar to
false
but we are leveraging code in the toolkit the does volume mount first and then bails if it finds. We still need unprivileged envvar capabilities for another use case where we want to share the same GPU between 2 containers in the same pod and need to inject the GPU ID from the gpu hog container into the other unprivileged container.
- Update the NVIDIA GPU operator to use volume based device listing:
apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: gpu-operator namespace: kube-system spec: # Set explicitly - NVIDIA breaks CRDs between versions chart: https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v1.11.1.tgz targetNamespace: gpu-operator set: driver.enabled: "false" toolkit.enabled: "false" devicePlugin.env[0].name: "DEVICE_LIST_STRATEGY" devicePlugin.env[0].value: "volume-mounts"
- Change the default kuberay cluster chart values to override head and worker variables to ensure it can never get all GPUs via the envvar method:
NVIDIA_VISIBLE_DEVICES=void
- Rely on the volume mount strategy to assign the correct GPUs into the unprivileged containers using
nvidia.com/gpu
limit.Looks like NVIDIA is doing something in the future with upcoming Kubernetes versions which is going to stop this mayhem in shared environments: https://gitlab.com/nvidia/cloud-native/k8s-dra-driver
Thank @dcarrion87 for this information! I will take a look at this today.
This appears to be an NVIDIA problem. Decreasing priority for now.
What happened + What you expected to happen
nvidia.com/gpu
limits the head and workers nodes are coming up with all 8 GPU devices every time. E.g. I limit to 2 on a worker node and it comes up with 8. I specify none for head node it comes up with 8.AUTOSCALER_CONSERVE_GPU_NODES=0
to get around allocation has described here: https://github.com/ray-project/ray/issues/29658Interestingly enough if
NVIDIA_VISIBLE_DEVICES=void
on the head node it correctly avoids assigning GPU. I wonder if the NVIDIA GPU operator is being bypassed?