[Bug] [raycluster-controller] Kuberay cannot recreate new raycluster header pod when it has been evicted by kubelet as disk pressure

xjhust commented 1 month ago

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I have deploy kuberay-operator and ray-cluster in my k8s cluster, it works fine most of the time. When I use fallocate command to enforce the k8s node to disk-pressure status where the raycluster head pod runs, the raycluster head pod will be evicted by kubelet. There is no new head pod created on the other k8s node with normal status after long time. And when I clear large file created by fallocate commad and relieve the k8s node disk pressure, the raycluster pod will by still evicted and no new head pod created. So I have to manual delete the evicted head pod to make it work again, and obviously this will make our production environment unstable and the service will not be highly available. My expected behavior is that the raycluster controller works well like deployment, when the head pod been evicted, it will auto recreate new head pod.

Reproduction script

Basic version information

Kubernetes: v1.20.15 ray-operator: v1.0.0 raycluster(ray): 2.9.0

Reproduction steps

deploy kuberay-operator and raycluster in the k8s cluster;
find the k8s node where the raycluster head pod runs, and use fallocate command like fallocate -l 1000G tmpfile to make the imagefs full, and then the raycluster head pod will be evicted by kubelet as disk pressure;
wait for a while, there is no new head pod created on the other k8s node;
release the imagefs by delete the tmpfile, and wait for a while, the raycluster head pod is still evicted, and the raycluster is still not available.

Anything else

This will happen every time when we make the raycluster head evicted. We can find the relative log of the ray-operator pod, such as: 2024-05-08T07:55:30.304Z INFO controllers.RayCluster reconcilePods {"Found 1 head Pod": "ray-cluster-head-dqtxb", "Pod status": "Failed", "Pod restart policy": "Always", "Ray container terminated status": "nil"} 2024-05-08T07:55:30.304Z INFO controllers.RayCluster reconcilePods {"head Pod": "ray-cluster-head-dqtxb", "shouldDelete": false, "reason": "The status of the head Pod ray-cluster-head-dqtxb is Failed. However, KubeRay will not delete the Pod because its restartPolicy is set to 'Always' and it should be able to restart automatically."}

So I read the relative code in raycluster controller, I found that the raycluster controller rely on the Pod's restart policy to restart the Pod when its status is failed. But In this case, the pod has been evicted by kubelet will not restart, so the raycluster will not work after evicted.

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

DmitriGekhtman commented 1 month ago

Well, at least the log message clearly expresses the authors' misunderstanding of Kubernetes eviction behavior (which to be fair is extremely confusing).

JasonChen86899 commented 5 days ago

@xjhust

the pod has been evicted by kubelet will not restart

The pods expelled by the K8S node will be re-scheduled. If not, you need to check if your pod has some kind of affinity set or something other that make pod not re-scheduled

DmitriGekhtman commented 5 days ago

@xjhust

the pod has been evicted by kubelet will not restart

The pods expelled by the K8S node will be re-scheduled.

That is not true. A pod evicted due to disk pressure is marked failed and will never be rescheduled. If the pod is part of a replicaset, then a new pod will be created to replace it. This issue is asking for the KubeRay operator to do the same thing as the replicaset controller.

andrewsykim commented 5 days ago

In the case where a Node is down for long enough and Kubernetes evicts the pod (i.e. deletes it), there should be some logic in KubeRay operator to recreate the head pod. As Dmitri already said, in the case of disk pressure, the Pod object is never deleted so the recreate logic never kicks in.

However, per the stackoverflow discussion shared earlier, evicted pods from disk pressure usually have a Failed status. And there is actually logic in raycluster controller to delete / recreated Pods in failed status. See shouldDeletePod. But KubeRay's recreate logic does not apply if the head pod has a restart policy of Always, see the comment sin shouldDeletePod for the reason.

@xjhust it seems like if you set restart policy of the head pod to OnFailure or Never, then you might be able to trigger the head recreate logic on disk pressure eviction.

andrewsykim commented 5 days ago

This issue is asking for the KubeRay operator to do the same thing as the replicaset controller.

I suspect that if we ran the head pod as a single-replica StatefulSet, we may be able to automatically recover from some of these scenarios. @kevin85421 have we ever considered this before?

DmitriGekhtman commented 5 days ago

This issue is asking for the KubeRay operator to do the same thing as the replicaset controller.

I suspect that if we ran the head pod as a single-replica StatefulSet, we may be able to automatically recover from some of these scenarios. @kevin85421 have we ever considered this before?

This might not be the right move, since some users will expect the head pod to exit on failure. (You wouldn't be able to support restart policy Never.)

My suggestion would be to modify the operator code to create a new pod if the restart policy is Always and the pod enters Failed state.

andrewsykim commented 5 days ago

My suggestion would be to modify the operator code to create a new pod if the restart policy is Always and the pod enters Failed state.

The comments here implies pods with restart Policy Always never reach Failed state, but maybe we never tested this with disk pressure eviction. Is someone able to verify this before we change this behavior?

DmitriGekhtman commented 5 days ago

My suggestion would be to modify the operator code to create a new pod if the restart policy is Always and the pod enters Failed state.

The comments here implies pods with restart Policy Always never reach Failed state, but maybe we never tested this with disk pressure eviction. Is someone able to verify this before we change this behavior?

I'm pretty sure the comment is incorrect. I agree that someone should confirm -- try inducing a disk pressure execution on a pod with restart policy Always, as described in the issue description.

andrewsykim commented 5 days ago

@xjhust can you help confirm this?

kevin85421 commented 5 days ago

since some users will expect the head pod to exit on failure.

@DmitriGekhtman: currently, if the head Pod's restartPolicy is Never and the Pod becomes Failed, KubeRay will delete the head Pod and create a new one. Is the recreate behavior OK for "exit on failure" you mentioned above?

DmitriGekhtman commented 5 days ago

@xjhust can you help confirm this?

Note the issue description includes a suggestion for reproduction (run the fallocate command). If you have easy access to a Kubernetes cluster, it should be straightforward to reproduce.

I've observed this particular behavior (pod with restart policy always failing due to node pressure conditions) in production systems -- usually due to a large core dump after a segfault.

andrewsykim commented 5 days ago

I would feel okay with removing the isRestartPolicyAlways check. It seems like in most normal conditions the Pod phase never flips to Failed, so the deletion would never happen, but in the uncommon cases like disk pressure it would trigger the delete & recreate, which can't be worse than the current state.

kevin85421 commented 5 days ago

Try to summarize the current discussion:

Currently, if a Pod's restartPolicy is Always, KubeRay will not delete the Pod. Instead, KubeRay waits for the Pod restarts by itself.
However, if a Pod is evicted due to disk pressure, the Pod's status will transition to Failed and will not be relocated to other K8s node even if the restartPolicy is Always.
Our solution: if the Pod's status is Failed or Succeeded, KubeRay should delete and create a new Pod no matter the value of restartPolicy.

Is my understanding correct? cc @DmitriGekhtman @andrewsykim

andrewsykim commented 5 days ago

Yes, that's my understanding too, but I personally have not tested the pod status behavior during disk eviction, it'd be good to verify this behavior. Other than that, it seems like an improvement to handle Failed status even if restartPolicy is Always

DmitriGekhtman commented 5 days ago

since some users will expect the head pod to exit on failure.

@DmitriGekhtman: currently, if the head Pod's restartPolicy is Never and the Pod becomes Failed, KubeRay will delete the head Pod and create a new one. Is the recreate behavior OK for "exit on failure" you mentioned above?

Thanks for clarifying this -- I misunderstood the current behavior of the Never restart policy. It sounds good to keep the current behavior in the Never restart policy case.

Our solution: if the Pod's status is Failed or Succeeded, KubeRay should delete and create a new Pod no matter the value of restartPolicy.

Is my understanding correct? cc @DmitriGekhtman @andrewsykim

That sounds like a good idea.

Context on why pods are marked failed in node pressure situations: the kubelet needs to protect the node from the faulty pod's resource usage.

The Kubernetes docs do not directly address the case of restart policy always. But the behavior can be inferred from this section: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#self-healing-behavior

"If the pods are managed by a workload management object (such as StatefulSet or Deployment) that replaces failed pods, the control plane (kube-controller-manager) creates new pods in place of the evicted pods."

You can infer that if a controller does not replace the failed pod, there will be no replacement.

DmitriGekhtman commented 5 days ago

If it's possible to use a 1-pod replicaset with a deterministic name for the Ray head, it would also resolve https://github.com/ray-project/kuberay/issues/715, for the Ray head.

andrewsykim commented 5 days ago

1-pod replicaset

What's the reason for a 1 pod replicaset instead of a statefulset?

DmitriGekhtman commented 5 days ago

1-pod replicaset

What's the reason for a 1 pod replicaset instead of a statefulset?

Likely either would work -- we could discuss the relative merits. Main reason I'd lean towards a replicaset is that it is simpler. All we need is the ability to reliably keep exactly 1 pod running, and a replicaset is the simplest primitive that allows this. Basically, I think the capabilities and limitations of statefulsets are extraneous to KubeRay requirements; but I could be wrong!

JasonChen86899 commented 4 days ago

Likely either would work -- we could discuss the relative merits. Main reason I'd lean towards a replicaset is that it is simpler. All we need is the ability to reliably keep exactly 1 pod running, and a replicaset is the simplest primitive that allows this. Basically, I think the capabilities and limitations of statefulsets are extraneous to KubeRay requirements; but I could be wrong!

I think 1-pod replicaset or deployment for head and statefulset for work will be ok. But if use them to reconcile headnode or worknode, reconcilePods function will be changed a lot and need test

JasonChen86899 commented 4 days ago

Try to summarize the current discussion:

Currently, if a Pod's restartPolicy is Always, KubeRay will not delete the Pod. Instead, KubeRay waits for the Pod restarts by itself.

However, if a Pod is evicted due to disk pressure, the Pod's status will transition to Failed and will not be relocated to other K8s node even if the restartPolicy is Always.

Our solution: if the Pod's status is Failed or Succeeded, KubeRay should delete and create a new Pod no matter the value of restartPolicy.

Is my understanding correct? cc @DmitriGekhtman @andrewsykim

If we ignore the container restart strategy, I feel that there will be a scenario where the container is already restarting or even successful, but we are deleting it according to its previous failed state, whether it will cause a temporary exception to the node of the ray head.

Is it possible to make a simple judgment about the eviction scenario?

isPodEvicted := pod.Status.Reason == "Evicted"
if isRestartPolicyAlways && !isPodEvicted {
    reason := fmt.Sprintf(
        "The status of the %s Pod %s is %s. However, KubeRay will not delete the Pod because its restartPolicy is set to 'Always' "+
            "and it should be able to restart automatically.", nodeType, pod.Name, pod.Status.Phase)
    return false, reason
}

DmitriGekhtman commented 4 days ago

statefulset for work will be ok

StatefulSet for workers wouldn't work for the autoscaling use-case. When scaling down, we need to specify the specific worker that's being removed. So, we need to manage individual pods for workers.

DmitriGekhtman commented 4 days ago

Is it possible to make a simple judgment about the eviction scenario?

Yeah, it'd be best for now to make the simplest fix to the code that's causing the issue :)

kevin85421 commented 4 days ago

If we ignore the container restart strategy, I feel that there will be a scenario where the container is already restarting or even successful

If a Pod is Succeeded or Failed, the Pod will not restart based on this doc. In addition, the head Pod of a RayCluster should always be on if the CR's suspend is not true.

JasonChen86899 commented 3 days ago

If a Pod is Succeeded or Failed, the Pod will not restart based on this doc.

Pods that have an auto-restart policy set may undergo a transition between the Failed and Running states, but usually return to the Running state quickly

kevin85421 commented 3 days ago

Got it. It makes sense. Thanks!

MortalHappiness commented 13 hours ago

@kevin85421 Can I work on this issue? After some trial and errors I am finally able to reproduce the eviction behavior on my own computer. (I tried to let the head pod be evicted due to low memory on the node) Here are the steps:

# Create k3d cluster with 2 agent nodes, each with 3GB memory and hard eviction limit 1GiB
k3d cluster create \                                                                                                                                        [20:31:01]
  --agents 2 \
  --k3s-arg "--disable=traefik@server:0" \
  --agents-memory 3g \
  --k3s-arg "--kubelet-arg=eviction-hard=memory.available<1Gi@agent:0" \
  --k3s-arg "--kubelet-arg=eviction-hard=memory.available<1Gi@agent:1"

kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY_MEMORY:.status.capacity.memory,ALLOCATABLE_MEMORY:.status.allocatable.memory 
# Output:
# 
# NAME                       CAPACITY_MEMORY   ALLOCATABLE_MEMORY
# k3d-k3s-default-agent-1    3221225Ki         2172649Ki
# k3d-k3s-default-agent-0    3221225Ki         2172649Ki
# k3d-k3s-default-server-0   32590664Ki        32590664Ki

# Taint agent 0 and 1 so that pods will only be deployed to server-0
kubectl taint nodes k3d-k3s-default-agent-0 k3d=noschedule:NoSchedule
kubectl taint nodes k3d-k3s-default-agent-1 k3d=noschedule:NoSchedule

# Install Kuberay operator
helm install kuberay-operator kuberay/kuberay-operator --namespace ray-system --version 1.1.1 --create-namespace

# Taint server 0 and untaint agent 0 and 1 so that pods will only be deployed to agent 0 or 1
kubectl taint nodes k3d-k3s-default-server-0 k3d=noschedule:NoSchedule
kubectl taint nodes k3d-k3s-default-agent-0 k3d=noschedule:NoSchedule-
kubectl taint nodes k3d-k3s-default-agent-1 k3d=noschedule:NoSchedule-

# Install RayCluster
# Note that the head and worker pods will be deployed to different nodes
# because the memory resource request for head pod is 2G
# and the memory resource request for worker pod is 1G
# and agent 0 and 1 each only has 2G allocatable memory
helm install raycluster kuberay/ray-cluster --version 1.1.1

# Copy statically linked "stress-ng" binary into head pod
kubectl cp ./stress-ng <head-pod>:/home/ray

# Open a shell on the head pod
kubectl exec -it <head-pod> -- bash

# Simulate memory stress
./stress-ng --vm 4 --vm-bytes 2G --vm-keep

Result:

The head pod is evicted.
No new head pods are created even if the memory on agent-0 is released and becomes available again. (See figure 2)

kubectl describe <head-pod>

Name:             raycluster-kuberay-head-ldg9f
Namespace:        default
Priority:         0
Service Account:  default
Node:             k3d-k3s-default-agent-0/192.168.128.4
Start Time:       Tue, 02 Jul 2024 22:16:22 +0800
Labels:           app.kubernetes.io/created-by=kuberay-operator
                  app.kubernetes.io/instance=raycluster
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=kuberay
                  helm.sh/chart=ray-cluster-1.1.1
                  ray.io/cluster=raycluster-kuberay
                  ray.io/group=headgroup
                  ray.io/identifier=raycluster-kuberay-head
                  ray.io/is-ray-node=yes
                  ray.io/node-type=head
Annotations:      ray.io/ft-enabled: false
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: memory. Threshold quantity: 1Gi, available: 991045Ki. 
IP:               10.42.0.2
IPs:
  IP:           10.42.0.2
Controlled By:  RayCluster/raycluster-kuberay
Containers:
  ray-head:
    Container ID:  
    Image:         rayproject/ray:2.9.0
    Image ID:      
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
      -lc
      --
    Args:
      ulimit -n 65536; ray start --head  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365  --num-cpus=1  --memory=2000000000 
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  2G
    Requests:
      cpu:      1
      memory:   2G
    Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
    Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=10s timeout=1s period=5s #success=1 #failure=10
    Environment:
      RAY_CLUSTER_NAME:                      (v1:metadata.labels['ray.io/cluster'])
      RAY_CLOUD_INSTANCE_ID:                raycluster-kuberay-head-ldg9f (v1:metadata.name)
      RAY_NODE_TYPE_NAME:                    (v1:metadata.labels['ray.io/group'])
      KUBERAY_GEN_RAY_START_CMD:            ray start --head  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365  --num-cpus=1  --memory=2000000000 
      RAY_PORT:                             6379
      RAY_ADDRESS:                          127.0.0.1:6379
      RAY_USAGE_STATS_KUBERAY_IN_USE:       1
      RAY_USAGE_STATS_EXTRA_TAGS:           kuberay_version=v1.1.1;kuberay_crd=RayCluster
      REDIS_PASSWORD:                       
      RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE:  1
    Mounts:
      /dev/shm from shared-mem (rw)
      /tmp/ray from log-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vhmqp (ro)
Conditions:
  Type               Status
  DisruptionTarget   True 
  Initialized        True 
  Ready              False 
  ContainersReady    False 
  PodScheduled       True 
Volumes:
  log-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  shared-mem:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  2G
  kube-api-access-vhmqp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason               Age    From               Message
  ----     ------               ----   ----               -------
  Normal   Scheduled            15m    default-scheduler  Successfully assigned default/raycluster-kuberay-head-ldg9f to k3d-k3s-default-agent-0
  Normal   Pulling              15m    kubelet            Pulling image "rayproject/ray:2.9.0"
  Normal   Pulled               14m    kubelet            Successfully pulled image "rayproject/ray:2.9.0" in 46.191933304s (46.1919395s including waiting)
  Normal   Created              14m    kubelet            Created container ray-head
  Normal   Started              14m    kubelet            Started container ray-head
  Warning  Evicted              9m22s  kubelet            The node was low on resource: memory. Threshold quantity: 1Gi, available: 991045Ki.
  Normal   Killing              9m22s  kubelet            Stopping container ray-head
  Warning  Unhealthy            9m20s  kubelet            Liveness probe failed: command "bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success" timed out
  Warning  ExceededGracePeriod  9m12s  kubelet            Container runtime did not kill the pod within specified grace period.

MortalHappiness commented 12 hours ago

I also tried kubectl drain but the head pod will be created on another node. This may be the API-Initiated Eviction in the stackoverflow post that @DmitriGekhtman mentioned.

andrewsykim commented 12 hours ago

@MortalHappiness what's the restart policy in your head pod? (it wasn't in the describe output)

If it's Always and you validated pod state as Failed during disk pressure eviction, then I think this comment is not true and it can be updated.

MortalHappiness commented 12 hours ago

@andrewsykim I used the default helm chart without specifying any values. So the restartPolicy is Always.

ray-project / kuberay