[Core]Ray head crashed silently - improve observability for redis timeouts causing said crash

zhangsikai123 commented 2 months ago

What happened + What you expected to happen

Context: How severe: High Case: raycluster + raydata + rayjob to create distributed inference task Depends: python3.10.13, ray2.34.0 Problem description: ray head container exits and restart occasionally when submitting jobs to raycluster using curl. This will cause the job submission failed and also the failures of all running jobs. Any logs: In general: None. No errors or exceptions in both .out and .err files except: The node with node id: 61d503aa6ca8f1753c9dd8c9d93fcb5ff915850197604ec5f7296526 and address: 172.22.3.35 and node name: 172.22.3.35 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 2 (2) raylet has lagging heartbeats due to slow network or busy workload. in raylet.out. But I guess this is just a red herring. Preliminary Investigation: I guess the root cause is the crash of gcs server. But there is no failure message shown in the gcs_server.out. The log basically just shows: [2024-08-30 05:03:56,409 I 27 27] (gcs_server) gcs_actor_manager.cc:1340: Actor created successfully job_id=0e000000 actor_id=42b3aab25a34ba665c6d303f0e000000 [2024-08-30 05:03:56,410 I 27 27] (gcs_server) gcs_actor_manager.cc:357: Finished creating actor. Status: OK job_id=0e000000 actor_id=42b3aab25a34ba665c6d303f0e000000 and ends. Therefore, we cannot validate it and also, we have no idea why it crashes. We basically ruled out the OOM issue given when the crash happens there are plenty of available bytes according to the cluster metrics. Extra: 1. For FT, we used redis as backend for GCS, we have guessed the connection to redis might be the cause, but still we cannot verify it given there is no error log. 2. From our observation, it happens quite definite when the cluster is idled for a while (probably half an hour) and accept a job submission all of a sudden.

Versions / Dependencies

python3.10.13, ray2.34.0, kuberay 1.1.1

Reproduction script

..

Issue Severity

High: It blocks me from completing my task.

anyscalesam commented 2 months ago

Could you spin up a WS in Hosted Ray on Anyscale here and snapshot a reproduction so we could collaborate on debugging futher @zhangsikai123 ?

zhangsikai123 commented 2 months ago

Could you spin up a WS in Hosted Ray on Anyscale here and snapshot a reproduction so we could collaborate on debugging futher @zhangsikai123 ?

Sadly, my account does not have access to that feature yet.

scottsun94 commented 2 months ago

@zhangsikai123 can you send me you email used for anyscale? I can grant access to that for you.

Update: access has been granted

jjyao commented 2 months ago

@zhangsikai123, can you disable Redis and try again. We only support redis FT for RayService: https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html

zhangsikai123 commented 2 months ago

@zhangsikai123, can you disable Redis and try again. We only support redis FT for RayService: https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html

yes we could try without redis, however it is unbearable for us to use ray cluster without FT.

richardliaw commented 2 months ago

hmm, I don't think Redis/head node FT actually solves your issue here. It's very likely this is an application error, rather than a transient error. Let's get a reproducible example!

zhangsikai123 commented 1 month ago

more info: 1. https://github.com/ray-project/ray/blob/master/python/ray/scripts/scripts.py`

            with cli_logger.indented():
                for process_type, process in unexpected_deceased:
                    cli_logger.error(
                        "{}",
                        cf.bold(str(process_type)),
                        _tags={"exit code": str(process.returncode)},
                    )

            cli_logger.newline()
            cli_logger.error("Remaining processes will be killed.")
            # explicitly kill all processes since atexit handlers
            # will not exit with errors.
            node.kill_all_processes(check_alive=False, allow_graceful=False)
            #### fix ####
            cli_logger.flush()
            #### fix ####
            os._exit(1)`

after adding cli_logger.flush() before os._exit(1), I was able to catch the exit code and err message: Some Ray subprocesses exited unexpectedly: gcs_server exit code=-13

To replicate the error, we created a sandbox ray-cluster and use script to submit job with random time interval (from 5min to 2hr). Because if we submit the job with fixed time interval, it won't come out.
We suspect the error cause is the redis-server we used, which is from our cloud provider hence we created another controlled experiment with ray cluster using on-premise redis-server. And it turned out using on-premise redis-server won't raise any problem.

zhangsikai123 commented 1 month ago

more info: after adding couple line of code in .py and .cc and rebuild ray, we are able to catch the error code and stack trace as follows:

src/ray/gcs/redis_client.cc:82 (PID: 28, TID: 28, errno: 32 (Broken pipe)): Check failed: reply 
*** StackTrace Information ***
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x7f0627) [0x55c83bac4627] ray::operator<<()
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x7f4a70) [0x55c83bac8a70] ray::RayLog::~RayLog()
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x57da38) [0x55c83b851a38] ray::gcs::RedisClient::GetNextJobID()
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x46949d) [0x55c83b73d49d] ray::gcs::GcsJobManager::HandleGetNextJobID()
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x34cb5c) [0x55c83b620b5c] std::_Function_handler<>::_M_invoke()
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x6e653f) [0x55c83b9ba53f] EventTracker::RecordExecution()
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x6e1a9f) [0x55c83b9b5a9f] std::_Function_handler<>::_M_invoke()
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x6e1f4c) [0x55c83b9b5f4c] boost::asio::detail::completion_handler<>::do_complete()
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x807384) [0x55c83badb384] boost::asio::detail::scheduler::do_run_one()
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x808815) [0x55c83badc815] boost::asio::detail::scheduler::run()
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x80a816) [0x55c83bade816] boost::asio::io_context::run()
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x208046) [0x55c83b4dc046] main
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7fd911a0424a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7fd911a04305] __libc_start_main
/nfs_beijing/model_server_prod/envs/xtrimo/xtrimo_py310/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server(+0x267f2e) [0x55c83b53bf2e] _start

We are wondering if this is related to the issue using hiredis: https://github.com/redis/hiredis/issues/910

rynewang commented 1 month ago

This check failure comes from when GCS can't connect to Redis. After RAY_redis_db_connect_retries = 120 retries, it check fails. Can you also visit your redis server and its logs to see anything wrong happening there?

zhangsikai123 commented 1 month ago

Thanks for the help from you guys, we have finally targeted the cause: Our redis server has a 3600s timeout config which means if the connection is idled over an hour the server will cut the line and client will receive a SIGPIPE signal, which will stop the process in most cases if not being handled properly. After set the timeout to 0, the crash stopped.

zhangsikai123 commented 1 month ago

Still we want to raise some observability issues: 1. ray's script code https://github.com/ray-project/ray/blob/master/python/ray/scripts/scripts.py` failed to output the processes' exit code because the logger does not flush the error message before exit. 2. gcs server does not handle the SIGPIPE signal, I have to add a signal handler and rebuild the gcs server to print out the stacktrace to understand the error. https://github.com/ray-project/ray/blob/master/src/ray/gcs/gcs_server/gcs_server_main.cc#L138 Hope this experience can be fruitful to our community.

rynewang commented 1 month ago

This is interesting! We will think about how to improve messaging ever further. Thanks for the debug!

anyscalesam commented 1 month ago

@rynewang reopening just so we can track this as an enhancement with the full context. Ditto - TY @zhangsikai123 on going deep, resolving, and sharing the dets here.

On that point actually; would you be interested in contributing here. @zhangsikai123 ?

zhangsikai123 commented 1 month ago

@rynewang reopening just so we can track this as an enhancement with the full context. Ditto - TY @zhangsikai123 on going deep, resolving, and sharing the dets here.

On that point actually; would you be interested in contributing here. @zhangsikai123 ?

We could submit a pr for review about catching and reporting the signal and also properly handling the redis disconnection

jjyao commented 2 weeks ago

We could submit a pr for review about catching and reporting the signal and also properly handling the redis disconnection

@zhangsikai123 You are welcome to submit a PR and we can help review it.

zhangsikai123 commented 2 weeks ago

We could submit a pr for review about catching and reporting the signal and also properly handling the redis disconnection

@zhangsikai123 You are welcome to submit a PR and we can help review it.

Two things: 1. I just checked the latest version, seems the observability issue has been solved, my gcs server is able to log error message once the redis connection is idled to timeout. 2. gcs server should be able to handle the idle redis connection timeout yet refactoring that part of code effects a lot and I think it might be better to leave it to the core team to decide if it's worth contributing.

ray-project / ray