ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
963 stars 328 forks source link

[Bug] KubeRay cluster resource status is reporting Ready when there are pods still pending #2188

Open tsailiming opened 2 weeks ago

tsailiming commented 2 weeks ago

Search before asking

KubeRay Component

apiserver

What happened + What you expected to happen

When there are pods stuck in Pending because of insufficient resources, the RayCluster state is reported as ready.

status:
  desiredCPU: "22"
  desiredGPU: "4"
  desiredMemory: 24G
  desiredTPU: "0"
  desiredWorkerReplicas: 2
  endpoints:
    client: "10001"
    dashboard: "8265"
    gcs: "6379"
    metrics: "8080"
  head:
    serviceIP: 172.30.12.150
  lastUpdateTime: "2024-06-12T13:35:00Z"
  maxWorkerReplicas: 2
  minWorkerReplicas: 2
  observedGeneration: 2
  state: ready

This is the status from the head pod

  status:
  phase: Pending
  conditions:
    - type: PodScheduled
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2024-06-12T13:55:11Z'
      reason: Unschedulable
      message: '0/5 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 3 node(s) didn''t match Pod''s node affinity/selector. preemption: 0/5 nodes are available: 1 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..'
  qosClass: Burstable

Reproduction script

  1. Submit a RayCluster that meets the ClusterQueue quota requirement so that it runs and not in Suspended state
  2. The worker node(s) has insufficient resources to run the pods.

Anything else

No response

Are you willing to submit a PR?

tsailiming commented 2 weeks ago

@astefanutti Filed this as per your request.

andrewsykim commented 2 weeks ago

@tsailiming what's the KubeRay version? In previous versions it is a known isuse that RayCluster status indefinitly ready once it observes all worker pods as running. There's some discussion about it in https://github.com/ray-project/kuberay/pull/1930

tsailiming commented 2 weeks ago

From one of the head pod. This is from OpenShift AI 2.9.1.

$ ray --version
ray, version 2.7.1
andrewsykim commented 2 weeks ago

@tsailiming I meant the KubeRay version, not the Ray version