ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
982 stars 330 forks source link

[Bug] Ray Head access to extra GPU resources #2098

Open shaowei-su opened 2 months ago

shaowei-su commented 2 months ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

If Ray head node is scheduled on GPU node with no GPU resource requested, e.g

      resources:
        limits:
          ephemeral-storage: 10Gi
          memory: 16Gi
        requests:
          cpu: '4'
          ephemeral-storage: 10Gi
          memory: 16Gi

Ray resource scheduler can still access those GPUs accidentally and considered the entire host GPU as "Logical Resources" during scheduling.

Screenshot 2024-04-23 at 16 39 18 Screenshot 2024-04-23 at 16 39 11

Reproduction script

Use RayJob CRD to scheduled both head and workers on the same physical host with > 1 GPUs.

Anything else

No response

Are you willing to submit a PR?

kevin85421 commented 2 months ago

This is not a KubeRay-specific issue. See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/gpu.html#gpu-multi-tenancy for more details. Recently, GPU UX on K8s seems to have improved. I will take a look at MIG and time-slicing GPU and get back to you.