Closed yncxcw closed 2 years ago
cc @WangTaoTheTonic
I've seen this issue from multiple users. It looks like the heartbeat report is deferred when there's high disk operations.
Not really, the very first message we got is:
WARNING worker.py:1153 -- The node with node id a66ae4c1164ebc7ed7f125fbc7fa13e2fb060b09 has been marked dead because the detector has missed too many heartbeats from it.
Adding information from slack.
That's interesting that heartbeat timeout happens when disk IO is high. Could any disk operations impact heartbeat reporting or handling?
@WangTaoTheTonic I am not quite sure, but I"ve seen at least 3 users who all have the high disk IO and issues he mentioned.
This might be a good scenario to test in our release tests. But haven't you seen this issue before @WangTaoTheTonic in you guys internal repo?
This might be a good scenario to test in our release tests. But haven't you seen this issue before @WangTaoTheTonic in you guys internal repo?
Heartbeat timeouts usually happened in our environment when gcs has too many tasks to handle, like lots of actor submitting or failing over.
I think for this case, the heartbeat is missing from raylet. I've seen many WARNINGs from raylets that the heartbeat update was not reported on time from logs of users who reported this issue in the past.
I think for this case, the heartbeat is missing from raylet. I've seen many WARNINGs from raylets that the heartbeat update was not reported on time from logs of users who reported this issue in the past.
From our logs I also see lots of delayed heartbeat reporting(those WARNINGs, in test environment, host machines are oversold very much).
In this case I'm not sure if raylet really didn't report heartbeat for 30 seconds. That's a pretty long time for raylet as raylet did not do much heavy load. Maybe disk IO related?
If the heartbeat reporting is hanged by raylet's load we can separate a single thread for it. If raylet could not get cpu cycle itself I've no idea what we can do :(
@yncxcw
Can you actually check,
@yncxcw
Can you try starting your cluster with num_cpus=<true number - 1>
?
If that doesn't work, can you also provide some more specific diagnostic details?
How many nodes are in your cluster?
Are you running on a cloud provider, or what does your hardware look like?
Do you have special configuration for your containers? (Can you share the output of docker info
or anything else special that you may be doing?)
Can you provide a reproduction or as much details as possible about your workload?
Can you share metrics of resource utilization on any or all nodes? (cpu, memory, disk, network, etc)?
Sure, let me try to figure out the answers to these questions.
@yncxcw Actually, can you also try this?
After you start the head / worker nodes, grep your raylet / gcs_server pid and run
# For both worker / head nodes
sudo renice -n -19 [raylet_pid]
# Only for a head node
sudo renice -n -19 [gcs_server pid]
This will give higher OS scheduling priority to raylet and gcs server. I wonder if this will alleviate the issue.
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS):
Ray
: Ray-0.8.7Python
: Python-3.7Tensorflow
: Tensorflow-1.4OS
: Ubuntu-16.04 image on K8sContext:
We are using Ray for data loading, basically the Ray actor loads both images and labels off of the disk and run some preprocessing (mostly numpy stuff).
Stack trace:
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
If we cannot run your script, we cannot fix your issue.
It might be hard to reproduce as this might be an issue coupled with our storage system