Open kevin85421 opened 1 year ago
In my understanding, it makes sense that the task stops when the driver process dies because the driver process is the owner of the task. The Ray architecture paper does not explicitly mention this, so I am not 100% sure.
@iycheng is this correct? Thanks!
cc: @rkooo567 @rickyyx to investigate and see whether this is an easy fix for @iycheng
Hi @kevin85421 @iycheng
I wonder if there is any update on this issue? IMO this is more than an observability-ux problem. Job level FT is not well documented, and it's very confusing for user about how to develop resilient Ray job applications...
I will take a look in the coming sprint! If that's not too late.
Thanks @rickyyx I just created two new issues related to GCS FT: https://github.com/ray-project/ray/issues/38786 https://github.com/ray-project/ray/issues/38785 hope they are related!
The task information is not working with GCS HA now (basically all task data is stored in memory). It is actually very tricky to fix it because task information adds a lot of pressure to the storage (Redis) if we start persisting them in Redis storage.
What happened + What you expected to happen
gcs_rpc_server_reconnect_timeout_s
seconds (60s by default).In my understanding, it makes sense that the task stops when the driver process dies because the driver process is the owner of the task. The Ray architecture paper does not explicitly mention this, so I am not 100% sure.
I monitored the task's stdout log (e.g. /tmp/ray/session_latest/log/worker-aaaaa-ffffffff-123.out) on the worker Pod to confirm that the task had stopped.
There is no log / task / actor for the job, but the status is "RUNNING".
Versions / Dependencies
Reproduction script
Issue Severity
None