ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.95k stars 5.58k forks source link

[Dashboard][Observability] Dashboard shows that a task is still running after a RayCluster with GCS FT restarts #34507

Open kevin85421 opened 1 year ago

kevin85421 commented 1 year ago

What happened + What you expected to happen

  1. Create a RayCluster with GCS fault-tolerance.
  2. Execute a Python script on the head Pod to submit a task to a worker Pod.
  3. Kill the GCS server process on the head Pod
  4. Head Pod will be terminated by itself after gcs_rpc_server_reconnect_timeout_s seconds (60s by default).
  5. Worker Pods are still running, and the task is still running for a while until the driver process on the old head Pod is killed.
  6. New head Pod becomes ready. Connect to the dashboard, and the dashboard shows that the task is still running.

There is no log / task / actor for the job, but the status is "RUNNING".

Screen Shot 2023-04-17 at 5 47 16 PM

Versions / Dependencies

Reproduction script

# Create a Kubernetes cluster
kind create cluster --image=kindest/node:v1.23.0

# Install a KubeRay operator
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0

# Create a RayCluster with GCS FT
kubectl apply -f raycluster-ft.yaml

# Run test.py in the head Pod
python3 test.py

# Kill the GCS server process in the head Pod
pkill gcs_server

# (1) Wait for the new head to be ready.
# (2) Monitor the task's STDOUT log
# (3) Check the dashboard when the new head is ready.

Issue Severity

None

kevin85421 commented 1 year ago

In my understanding, it makes sense that the task stops when the driver process dies because the driver process is the owner of the task. The Ray architecture paper does not explicitly mention this, so I am not 100% sure.

@iycheng is this correct? Thanks!

scottsun94 commented 1 year ago

cc: @rkooo567 @rickyyx to investigate and see whether this is an easy fix for @iycheng

llidev commented 1 year ago

Hi @kevin85421 @iycheng

I wonder if there is any update on this issue? IMO this is more than an observability-ux problem. Job level FT is not well documented, and it's very confusing for user about how to develop resilient Ray job applications...

rickyyx commented 1 year ago

I will take a look in the coming sprint! If that's not too late.

llidev commented 1 year ago

Thanks @rickyyx I just created two new issues related to GCS FT: https://github.com/ray-project/ray/issues/38786 https://github.com/ray-project/ray/issues/38785 hope they are related!

rkooo567 commented 1 year ago

The task information is not working with GCS HA now (basically all task data is stored in memory). It is actually very tricky to fix it because task information adds a lot of pressure to the storage (Redis) if we start persisting them in Redis storage.