ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.26k stars 406 forks source link

[Bug] When using nvidia/GPU, worker pods cannot be created #2528

Closed BeerTai closed 4 hours ago

BeerTai commented 3 days ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

When I use nvidia/gpu in the yaml file, the worker's pod can not be created, but with raw cpu resources, the pod is created correctly

Reproduction script

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # A unique identifier for the head node and workers of this cluster.
  name: raycluster-complete
spec:
  rayVersion: '2.38.0'
  # Ray head pod template
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
      dashboard-host: '0.0.0.0'
    # pod template
    template:
      metadata:
        labels: {}
      spec:
        containers:
        - name: ray-head
          image: harbor.unijn.cn/ray
          resources:
            limits:
              cpu: 10
              memory: 20Gi
            requests:
              cpu: 10
              memory: 20Gi
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          volumeMounts:
            - mountPath: /fs/nlp/sunjian
              name: ray-logs
        volumes:
          - name: ray-logs
            emptyDir: {}
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 2
    minReplicas: 1
    maxReplicas: 4
    # logical group name, for this called large-group, also can be functional
    groupName: large-group
    rayStartParams: {}
    #pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: harbor.unijn.cn/ray
          # Optimal resource allocation will depend on your Kubernetes infrastructure and might
          # require some experimentation.
          # Setting requests=limits is recommended with Ray. K8s limits are used for Ray-internal
          # resource accounting. K8s requests are not used by Ray.
          resources:
            limits:
              nvidia.com/gpu: 8 
              memory: 500Gi 
            requests:
              nvidia.com/gpu: 8 
              memory: 500Gi 
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          # use volumeMounts.Optional.
          # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
          volumeMounts:
            - mountPath:  /fs/nlp/sunjian
              name: ray-logs
        # use volumes
        # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
        volumes:
          - name: ray-logs
            emptyDir: {}
######################status#################################

Anything else

only head pod image

Are you willing to submit a PR?

kevin85421 commented 2 days ago

Which KubeRay version are you using? If you’re using KubeRay v1.2.2, you can run kubectl describe raycluster $YOUR_RAYCLUSTER and check the Kubernetes events to see why the pods failed to create.

BeerTai commented 2 days ago

Which KubeRay version are you using? If you’re using KubeRay v1.2.2, you can run kubectl describe raycluster $YOUR_RAYCLUSTER and check the Kubernetes events to see why the pods failed to create.

I run kubectl describe raycluster, and get

Status:
  Desired CPU:              10
  Desired GPU:              8
  Desired Memory:           520Gi
  Desired TPU:              0
  Desired Worker Replicas:  1
  Endpoints:
    Client:     10001
    Dashboard:  8265
    Gcs:        6379
    Metrics:    8080
  Head:
    Pod IP:             10.233.105.84
    Pod Name:           raycluster-complete-head-x8x26
    Service IP:         10.233.105.84
    Service Name:       raycluster-complete-head-svc
  Last Update Time:     2024-11-12T01:17:32Z
  Max Worker Replicas:  4
  Min Worker Replicas:  1
  Observed Generation:  2
  State:                ready
  State Transition Times:
    Ready:  2024-11-11T08:45:40Z
Events:     <none>