[Bug] When using nvidia/GPU, worker pods cannot be created

BeerTai commented 3 days ago

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When I use nvidia/gpu in the yaml file, the worker's pod can not be created, but with raw cpu resources, the pod is created correctly

Reproduction script

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # A unique identifier for the head node and workers of this cluster.
  name: raycluster-complete
spec:
  rayVersion: '2.38.0'
  # Ray head pod template
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
      dashboard-host: '0.0.0.0'
    # pod template
    template:
      metadata:
        labels: {}
      spec:
        containers:
        - name: ray-head
          image: harbor.unijn.cn/ray
          resources:
            limits:
              cpu: 10
              memory: 20Gi
            requests:
              cpu: 10
              memory: 20Gi
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          volumeMounts:
            - mountPath: /fs/nlp/sunjian
              name: ray-logs
        volumes:
          - name: ray-logs
            emptyDir: {}
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 2
    minReplicas: 1
    maxReplicas: 4
    # logical group name, for this called large-group, also can be functional
    groupName: large-group
    rayStartParams: {}
    #pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: harbor.unijn.cn/ray
          # Optimal resource allocation will depend on your Kubernetes infrastructure and might
          # require some experimentation.
          # Setting requests=limits is recommended with Ray. K8s limits are used for Ray-internal
          # resource accounting. K8s requests are not used by Ray.
          resources:
            limits:
              nvidia.com/gpu: 8 
              memory: 500Gi 
            requests:
              nvidia.com/gpu: 8 
              memory: 500Gi 
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          # use volumeMounts.Optional.
          # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
          volumeMounts:
            - mountPath:  /fs/nlp/sunjian
              name: ray-logs
        # use volumes
        # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
        volumes:
          - name: ray-logs
            emptyDir: {}
######################status#################################

Anything else

only head pod

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

kevin85421 commented 2 days ago

Which KubeRay version are you using? If you’re using KubeRay v1.2.2, you can run kubectl describe raycluster $YOUR_RAYCLUSTER and check the Kubernetes events to see why the pods failed to create.

BeerTai commented 2 days ago