ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.39k stars 5.66k forks source link

Timeout for Raylet heartbeat for I/O intensive workloads #11624

Closed yncxcw closed 2 years ago

yncxcw commented 3 years ago

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): Ray: Ray-0.8.7 Python: Python-3.7 Tensorflow: Tensorflow-1.4 OS: Ubuntu-16.04 image on K8s

Context:

We are using Ray for data loading, basically the Ray actor loads both images and labels off of the disk and run some preprocessing (mostly numpy stuff).

Stack trace:

(pid=raylet) E1025 09:45:46.570907   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1025 09:45:47.035303     9   993 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=modulus.multi_task_loader.dataloader.core.data.ray_iterator, class_name=RemoteFetcher, function_name=fetch, function_hash=}, task_id=6cd1c8101e919ffa55d5f0d50100, job_id=0100, num_args=2, num_returns=2, actor_task_spec={actor_id=55d5f0d50100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=1}
e2emapnet-hanging-fix-6jnny4-0-train-9153b4-c29:9:121 [0] NCCL INFO Destroyed comm 0x7f8d080412a0 rank 0
(pid=raylet) E1025 09:45:48.027905   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.027952   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028023   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028045   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028066   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028097   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) F1025 09:45:48.036741   837   837 node_manager.cc:652]  Check failed: node_id != self_node_id_ Exiting because this node manager has mistakenly been marked dead by the monitor.
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet)     @     0x5614df845a3d  google::LogMessage::Fail()
(pid=raylet)     @     0x5614df846b9c  google::LogMessage::SendToLog()
(pid=raylet)     @     0x5614df845719  google::LogMessage::Flush()
(pid=raylet)     @     0x5614df845931  google::LogMessage::~LogMessage()
(pid=raylet)     @     0x5614df7fc379  ray::RayLog::~RayLog()
(pid=raylet)     @     0x5614df54f334  ray::raylet::NodeManager::NodeRemoved()
(pid=raylet)     @     0x5614df54f4ec  _ZNSt17_Function_handlerIFvRKN3ray8ClientIDERKNS0_3rpc11GcsNodeInfoEEZNS0_6raylet11NodeManager11RegisterGcsEvEUlS3_S7_E0_E9_M_invokeERKSt9_Any_dataS3_S7_
(pid=raylet)     @     0x5614df636390  ray::gcs::ServiceBasedNodeInfoAccessor::HandleNotification()
(pid=raylet)     @     0x5614df636666  _ZNSt17_Function_handlerIFvRKSsS1_EZZN3ray3gcs28ServiceBasedNodeInfoAccessor26AsyncSubscribeToNodeChangeERKSt8functionIFvRKNS3_8ClientIDERKNS3_3rpc11GcsNodeInfoEEERKS6_IFvNS3_6StatusEEEENKUlSM_E0_clESM_EUlS1_S1_E_E9_M_invokeERKSt9_Any_dataS1_S1_
(pid=raylet)     @     0x5614df640d0a  _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray3gcs13CallbackReplyEEEZNS2_9GcsPubSub24ExecuteCommandIfPossibleERKSsRNS6_7ChannelEEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
(pid=raylet)     @     0x5614df6427cb  _ZN5boost4asio6detail18completion_handlerIZN3ray3gcs20RedisCallbackManager12CallbackItem8DispatchERSt10shared_ptrINS4_13CallbackReplyEEEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(pid=raylet)     @     0x5614dfb292af  boost::asio::detail::scheduler::do_run_one()
(pid=raylet)     @     0x5614dfb2a7b1  boost::asio::detail::scheduler::run()
(pid=raylet)     @     0x5614dfb2b7e2  boost::asio::io_context::run()
(pid=raylet)     @     0x5614df4bbb52  main
(pid=raylet)     @     0x7fe485b9eb97  __libc_start_main
(pid=raylet)     @     0x5614df4cbf91  (unknown)
(pid=25446) F1025 09:45:49.821868 25446 25446 raylet_client.cc:106]  Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: Connection reset by peer
(pid=25446) *** Check failure stack trace: ***
(pid=25446)     @     0x7f52ca76b6cd  google::LogMessage::Fail()
(pid=25446)     @     0x7f52ca76c82c  google::LogMessage::SendToLog()
(pid=25446)     @     0x7f52ca76b3a9  google::LogMessage::Flush()
(pid=25446)     @     0x7f52ca76b5c1  google::LogMessage::~LogMessage()
(pid=25446)     @     0x7f52ca722ce9  ray::RayLog::~RayLog()
(pid=25446)     @     0x7f52ca472074  ray::raylet::RayletClient::RayletClient()
(pid=25446)     @     0x7f52ca412f30  ray::CoreWorker::CoreWorker()
(pid=25446)     @     0x7f52ca416e24  ray::CoreWorkerProcess::CreateWorker()
(pid=25446)     @     0x7f52ca417f42  ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=25446)     @     0x7f52ca4188ab  ray::CoreWorkerProcess::Initialize()
(pid=25446)     @     0x7f52ca371a7d  __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
(pid=25446)     @     0x7f52ca372d05  __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=25446)     @           0x551365  (unknown)
(pid=25446)     @           0x5a9cbc  _PyObject_FastCallKeywords
(pid=25446)     @           0x50a5c3  (unknown)
(pid=25446)     @           0x50bfb4  _PyEval_EvalFrameDefault
(pid=25446)     @           0x507d64  (unknown)
(pid=25446)     @           0x509a90  (unknown)
(pid=25446)     @           0x50a48d  (unknown)
(pid=25446)     @           0x50cd96  _PyEval_EvalFrameDefault
(pid=25446)     @           0x507d64  (unknown)
(pid=25446)     @           0x50ae13  PyEval_EvalCode
(pid=25446)     @           0x634c82  (unknown)
(pid=25446)     @           0x634d37  PyRun_FileExFlags
(pid=25446)     @           0x6384ef  PyRun_SimpleFileExFlags
(pid=25446)     @           0x639091  Py_Main
(pid=25446)     @           0x4b0d00  main
(pid=25446)     @     0x7f52cd0d1b97  __libc_start_main
(pid=25446)     @           0x5b250a  _start
(pid=25412) E1025 09:45:49.886169 25412 26225 core_worker.cc:691] Raylet failed.

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

It might be hard to reproduce as this might be an issue coupled with our storage system

ericl commented 3 years ago

cc @WangTaoTheTonic

rkooo567 commented 3 years ago

I've seen this issue from multiple users. It looks like the heartbeat report is deferred when there's high disk operations.

WangTaoTheTonic commented 3 years ago
Not really, the very first message we got is:
WARNING worker.py:1153 -- The node with node id a66ae4c1164ebc7ed7f125fbc7fa13e2fb060b09 has been marked dead because the detector has missed too many heartbeats from it.

Adding information from slack.

That's interesting that heartbeat timeout happens when disk IO is high. Could any disk operations impact heartbeat reporting or handling?

rkooo567 commented 3 years ago

@WangTaoTheTonic I am not quite sure, but I"ve seen at least 3 users who all have the high disk IO and issues he mentioned.

rkooo567 commented 3 years ago

This might be a good scenario to test in our release tests. But haven't you seen this issue before @WangTaoTheTonic in you guys internal repo?

WangTaoTheTonic commented 3 years ago

This might be a good scenario to test in our release tests. But haven't you seen this issue before @WangTaoTheTonic in you guys internal repo?

Heartbeat timeouts usually happened in our environment when gcs has too many tasks to handle, like lots of actor submitting or failing over.

rkooo567 commented 3 years ago

I think for this case, the heartbeat is missing from raylet. I've seen many WARNINGs from raylets that the heartbeat update was not reported on time from logs of users who reported this issue in the past.

WangTaoTheTonic commented 3 years ago

I think for this case, the heartbeat is missing from raylet. I've seen many WARNINGs from raylets that the heartbeat update was not reported on time from logs of users who reported this issue in the past.

From our logs I also see lots of delayed heartbeat reporting(those WARNINGs, in test environment, host machines are oversold very much).

In this case I'm not sure if raylet really didn't report heartbeat for 30 seconds. That's a pretty long time for raylet as raylet did not do much heavy load. Maybe disk IO related?

If the heartbeat reporting is hanged by raylet's load we can separate a single thread for it. If raylet could not get cpu cycle itself I've no idea what we can do :(

rkooo567 commented 3 years ago

@yncxcw

Can you actually check,

  1. Can you measure the cpu usage of all nodes, especially a head node and a worker node that was crashed due to the heartbeat issues?
wuisawesome commented 3 years ago

@yncxcw Can you try starting your cluster with num_cpus=<true number - 1>?

If that doesn't work, can you also provide some more specific diagnostic details?

  1. How many nodes are in your cluster?

  2. Are you running on a cloud provider, or what does your hardware look like?

  3. Do you have special configuration for your containers? (Can you share the output of docker info or anything else special that you may be doing?)

  4. Can you provide a reproduction or as much details as possible about your workload?

  5. Can you share metrics of resource utilization on any or all nodes? (cpu, memory, disk, network, etc)?

yncxcw commented 3 years ago

Sure, let me try to figure out the answers to these questions.

rkooo567 commented 3 years ago

@yncxcw Actually, can you also try this?

After you start the head / worker nodes, grep your raylet / gcs_server pid and run

# For both worker / head nodes
sudo renice -n -19 [raylet_pid]
# Only for a head node
sudo renice -n -19 [gcs_server pid]

This will give higher OS scheduling priority to raylet and gcs server. I wonder if this will alleviate the issue.