[Ray Cluster] Assigning all host GPUs into head node without nvidia.com/gpu present

dcarrion87 commented 2 years ago

What happened + What you expected to happen

Kubernetes 1.24 with docker + nvidia-runtime.
Running Kuberay 3.0.0 release.
Running RayCluster on an A100 with 8x GPUs.
Regardless of nvidia.com/gpu limits the head and workers nodes are coming up with all 8 GPU devices every time. E.g. I limit to 2 on a worker node and it comes up with 8. I specify none for head node it comes up with 8.
Other pods not controlled by Ray operator do not exhibit this behaviour.
In case it's related I added AUTOSCALER_CONSERVE_GPU_NODES=0 to get around allocation has described here: https://github.com/ray-project/ray/issues/29658

Interestingly enough if NVIDIA_VISIBLE_DEVICES=void on the head node it correctly avoids assigning GPU. I wonder if the NVIDIA GPU operator is being bypassed?

dcarrion87 commented 2 years ago

Figured this out. Noticed all the ray images are baking in NVIDIA_VISIBLE_DEVICES=all

nvidia-runtime config has this set accept-nvidia-visible-devices-envvar-when-unprivileged = true

So it's just assigning everything in.

cadedaniel commented 2 years ago

Thanks for reporting this issue @dcarrion87! Glad you've resolved it. Would it help if we made this more clear in the docs? cc @jjyao @DmitriGekhtman

dcarrion87 commented 2 years ago

I think so...

Looks like this is an issue in multiple projects. Noticed it with Kubeflow. What an absolute pain. Volume mounts may be the answer? https://docs.google.com/document/d/1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw/edit#

DmitriGekhtman commented 2 years ago

Regardless of nvidia.com/gpu limits the head and workers nodes are coming up with all 8 GPU devices every time. E.g. I limit to 2 on a worker node and it comes up with 8. I specify none for head node it comes up with 8.

Could you explain what you mean by "comes up with 8 GPUs"? Which system registers the presence of the GPUs? What's the precise undesirable behavior?

dcarrion87 commented 2 years ago

@DmitriGekhtman in a Kubernetes environment where we want GPU limits to be controlled by the NVIDIA GPU operator having NVIDIA_VISIBLE_DEVICES=all as an envvar in the images means that all GPUs on the host comes up instead of the controlled amount specified via "nvidia.com/gpu" limit. This document describes the undesirable affect and the changes to the spec to minimise this from happening using volume mounts: https://docs.google.com/document/d/1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw/edit#

DmitriGekhtman commented 2 years ago

Thanks, that makes sense.

DmitriGekhtman commented 2 years ago

cc @kevin85421

DmitriGekhtman commented 2 years ago

This is most likely an issue that should be solved in the KubeRay repo, but it's fine to track it here. I'd consider this a blocker for the KubeRay 0.4.0 release.

dcarrion87 commented 2 years ago

@DmitriGekhtman @kevin85421

We were able to solve this in case you want to document the approach anywhere. It heavily relies on a locked down Kubernetes workspace environment that we provide to users where they can just change the image and that's all.

Update /etc/nvidia-container-runtime/config.toml to support both volume and envvar based assignment

accept-nvidia-visible-devices-envvar-when-unprivileged = true
accept-nvidia-visible-devices-as-volume-mounts = true

You could set envvar to false but we are leveraging code in the toolkit the does volume mount first and then bails if it finds. We still need unprivileged envvar capabilities for another use case where we want to share the same GPU between 2 containers in the same pod and need to inject the GPU ID from the gpu hog container into the other unprivileged container.

Update the NVIDIA GPU operator to use volume based device listing:

  apiVersion: helm.cattle.io/v1
  kind: HelmChart
  metadata:
    name: gpu-operator
    namespace: kube-system
  spec:
    # Set explicitly - NVIDIA breaks CRDs between versions
    chart: https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v1.11.1.tgz
    targetNamespace: gpu-operator
    set:
      driver.enabled: "false"
      toolkit.enabled: "false"
      devicePlugin.env[0].name: "DEVICE_LIST_STRATEGY"
      devicePlugin.env[0].value: "volume-mounts"

Change the default kuberay cluster chart values to override head and worker variables to ensure it can never get all GPUs via the envvar method: NVIDIA_VISIBLE_DEVICES=void
Rely on the volume mount strategy to assign the correct GPUs into the unprivileged containers using nvidia.com/gpu limit.

Looks like NVIDIA is doing something in the future with upcoming Kubernetes versions which is going to stop this mayhem in shared environments: https://gitlab.com/nvidia/cloud-native/k8s-dra-driver

kevin85421 commented 2 years ago

@DmitriGekhtman @kevin85421

We were able to solve this in case you want to document the approach anywhere. It heavily relies on a locked down Kubernetes workspace environment that we provide to users where they can just change the image and that's all.

Update /etc/nvidia-container-runtime/config.toml to support both volume and envvar based assignment
accept-nvidia-visible-devices-envvar-when-unprivileged = true
accept-nvidia-visible-devices-as-volume-mounts = true
You could set envvar to false but we are leveraging code in the toolkit the does volume mount first and then bails if it finds. We still need unprivileged envvar capabilities for another use case where we want to share the same GPU between 2 containers in the same pod and need to inject the GPU ID from the gpu hog container into the other unprivileged container.

Update the NVIDIA GPU operator to use volume based device listing:
  apiVersion: helm.cattle.io/v1
  kind: HelmChart
  metadata:
    name: gpu-operator
    namespace: kube-system
  spec:
    # Set explicitly - NVIDIA breaks CRDs between versions
    chart: https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v1.11.1.tgz
    targetNamespace: gpu-operator
    set:
      driver.enabled: "false"
      toolkit.enabled: "false"
      devicePlugin.env[0].name: "DEVICE_LIST_STRATEGY"
      devicePlugin.env[0].value: "volume-mounts"
Change the default kuberay cluster chart values to override head and worker variables to ensure it can never get all GPUs via the envvar method: NVIDIA_VISIBLE_DEVICES=void

Rely on the volume mount strategy to assign the correct GPUs into the unprivileged containers using nvidia.com/gpu limit.

Looks like NVIDIA is doing something in the future with upcoming Kubernetes versions which is going to stop this mayhem in shared environments: https://gitlab.com/nvidia/cloud-native/k8s-dra-driver

Thank @dcarrion87 for this information! I will take a look at this today.

DmitriGekhtman commented 2 years ago

This appears to be an NVIDIA problem. Decreasing priority for now.

ray-project / ray

[Ray Cluster] Assigning all host GPUs into head node without nvidia.com/gpu present #29753

What happened + What you expected to happen