[Ray Core] Ray agent getting killed unexpectedly

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

34.27k stars 5.82k forks source link

[Ray Core] Ray agent getting killed unexpectedly #29412

Open pratkpranav opened 2 years ago

pratkpranav commented 2 years ago

What happened + What you expected to happen

We are writing a module for data-parallel training using ray for our machine learning engine. Currently, we are trying to scale our computation to 64 nodes on an in-house cluster, but while communication between the nodes, the ray agent on some nodes fails unexpectedly(It doesn't happen just after the start, instead, all the nodes run for a while, and then some nodes start failing). I am pretty sure the program is not running out of memory(As I don’t see any OOMKiller log-in dmesg). The program keeps running if a node fails that is not running a worker, however it terminates as soon as a node fails with a worker running on it. We are using Ray Collective Communication Lib(Gloo through pygloo), but we see the same failures even when using Ray Core for communication.

Error Log(Click to expand)

``` 2022-10-17 14:23:59,720 WARNING worker.py:1839 -- Raylet is terminated: ip=172.29.58.176, id=c28cee44b5cd67e730b3a9f729ca772f9bfd3f4b936ae1999d38cf36. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] NodeManagerService.grpc_server.GetNodeStats - 475 total (0 active), CPU time: mean = 1.077 ms, total = 511.665 ms [state-dump] NodeManager.deadline_timer.record_metrics - 432 total (1 active), CPU time: mean = 546.906 us, total = 236.263 ms [state-dump] NodeManager.deadline_timer.debug_state_dump - 108 total (1 active), CPU time: mean = 1.300 ms, total = 140.400 ms [state-dump] NodeResourceInfoGcsService.grpc_client.GetResources - 87 total (0 active), CPU time: mean = 13.138 us, total = 1.143 ms [state-dump] NodeManager.deadline_timer.print_event_loop_stats - 18 total (1 active, 1 running), CPU time: mean = 2.067 ms, total = 37.203 ms [state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 15 total (1 active), CPU time: mean = 225.370 us, total = 3.381 ms [state-dump] Subscriber.HandlePublishedMessage_GCS_NODE_INFO_CHANNEL - 11 total (0 active), CPU time: mean = 124.656 us, total = 1.371 ms [state-dump] PeriodicalRunner.RunFnPeriodically - 8 total (0 active), CPU time: mean = 357.759 us, total = 2.862 ms [state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 88.669 us, total = 177.338 us [state-dump] AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 341.472 us, total = 341.472 us [state-dump] Subscriber.HandlePublishedMessage_GCS_WORKER_DELTA_CHANNEL - 1 total (0 active), CPU time: mean = 3.499 us, total = 3.499 us [state-dump] RuntimeEnvService.grpc_client.GetOrCreateRuntimeEnv - 1 total (0 active), CPU time: mean = 52.743 us, total = 52.743 us [state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 1 total (0 active), CPU time: mean = 893.418 us, total = 893.418 us [state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 13.287 ms, total = 13.287 ms [state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 345.265 us, total = 345.265 us [state-dump] NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 2.833 ms, total = 2.833 ms [state-dump] JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 6.246 us, total = 6.246 us [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] (raylet, ip=172.29.58.176) [2022-10-17 14:24:00,156 E 692949 692988] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent w as unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. 2022-10-17 15:47:23,758 WARNING worker.py:1839 -- The node with node id: c28cee44b5cd67e730b3a9f729ca772f9bfd3f4b936ae1999d38cf36 and address: 172.29.58.176 and node name: 172.29.58.176 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (scheduler +1h42m22s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (scheduler +1h42m22s) Restarting 1 nodes of type local.cluster.node (lost contact with raylet). (raylet, ip=172.29.58.176) [2022-10-17 14:24:00,156 E 692949 692988] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent w as unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. 2022-10-17 15:57:44,909 WARNING worker.py:1839 -- Raylet is terminated: ip=172.29.58.107, id=d495158e712947f2e6fb8b3fc4a1ddad79adac63589745919c5083ab. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] RuntimeEnvService.grpc_client.GetOrCreateRuntimeEnv - 2 total (0 active), CPU time: mean = 20.050 ms, total = 40.101 ms [state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 85.026 us, total = 170.052 us [state-dump] ObjectManager.ObjectAdded - 2 total (0 active), CPU time: mean = 233.389 us, total = 466.779 us [state-dump] NodeManagerService.grpc_server.RequestWorkerLease - 1 total (0 active), CPU time: mean = 1.180 ms, total= 1.180 ms [state-dump] Subscriber.HandlePublishedMessage_GCS_WORKER_DELTA_CHANNEL - 1 total (0 active), CPU time: mean = 5.108 us, total = 5.108 us [state-dump] ObjectManager.HandlePull - 1 total (0 active), CPU time: mean = 1.457 ms, total = 1.457 ms [state-dump] NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 103.737 us, total = 103.737 us [state-dump] JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 4.774 us, total = 4.774 us [state-dump] ObjectManagerService.grpc_client.Pull - 1 total (0 active), CPU time: mean = 21.685 us, total = 21.685 us [state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 1 total (0 active), CPU time: mean = 486.567 us, total = 486.567 us [state-dump] NodeManagerService.grpc_server.GetSystemConfig - 1 total (0 active), CPU time: mean = 187.941 us, total = 187.941 us [state-dump] ClientConnection.async_write.DoAsyncWrites - 1 total (0 active), CPU time: mean = 39.085 us, total = 39.085 us [state-dump] AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 287.678 us, total = 287.678 us [state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 11.784 ms, total = 11.784 ms [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch - 1 total (0 active), CPU time: mean = 8.580 us, total = 8.580 us [state-dump] ObjectManager.ObjectDeleted - 1 total (0 active), CPU time: mean = 22.651 us, total = 22.651 us [state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 300.936 us, total = 300.936 us [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] (raylet, ip=172.29.58.107) [2022-10-17 15:57:45,682 E 286759 286803] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent w as unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. Traceback (most recent call last): File “mlm_training.py”, line 35, in wrapped_model.train() File “/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/distributed.py”, line 204, in train train_state_manager.train_batch(epoch, batch_id) File “/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py”, line 92, in train_batch self._compute_and_store_batch_gradients(batch_id) File “/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py”, line 105, in _compute_and_store_batch_gradients ray.get( File “/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper return func(*args, **kwargs) File “/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py”, line 2291, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. ```

Log Files from Head Node: logs_head.zip Log Files from Node 107: logs_107.zip Log Files from Node 176: logs_176.zip

Could it happen that we are hitting object store benchmark from ray-benchmarks?

Ray Discuss Link: https://discuss.ray.io/t/ray-actor-dying-unexpectedly/7797/6

Versions / Dependencies

Ray version: Using Daily release(As ray collective communication(for pyglooo) is working only after https://github.com/ray-project/ray/issues/29036) OS: ubuntu Python: 3.8.10

Cluster Info: Number of training nodes: 64 vCPUs per node: 4 RAM per node: 32GB

Reproduction script

This code doesn't exactly reproduces the error, but it do fails in almost the similar manner as the issue mentioned above and very similar to how the main script runs.

Code to Reproduce Error

``` import os import ray import numpy as np import ray.util.collective as col from ray.util.collective.types import Backend, ReduceOp @ray.remote(num_cpus=4, max_restarts=-1) class communicating_actor: def __init__(self, rank, world_size, group_name, init_data): self.init_data = init_data col.init_collective_group(world_size, rank, Backend.GLOO, group_name) def test_allreduce(self, group_name): ''' rank # Rank of this process within list of participating processes world_size # Number of participating processes fileStore_path # The path to create filestore ''' self.sendbuf = np.ones((4096,1024,256), dtype=np.float32) col.allreduce(self.sendbuf, group_name, ReduceOp.SUM) if __name__ == "__main__": ray.init(address='auto') world_size = 64 init_data = np.ones((4096,1024,256), dtype=np.float32) ref = ray.put(init_data) communicating_actors = [communicating_actor.remote(rank, world_size, "default", ref) for rank in range(world_size)] for i in range(1000): ray.get([actor.test_allreduce.remote("default") for actor in communicating_actors]) ```

Cluster Configuration File

``` auth: ssh_user: root cluster_name: default cluster_synced_files: [] file_mounts: {} file_mounts_sync_continuously: false head_setup_commands: [] head_start_ray_commands: - ray stop - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --system-config='{"num_heartbeats_timeout":5000,"worker_register_timeout_seconds":500}' idle_timeout_minutes: 5 initialization_commands: [] max_workers: 87 min_workers: 87 provider: head_ip: 172.29.58.24 type: local worker_ips: - 172.29.58.102 - 172.29.58.103 - 172.29.58.104 - 172.29.58.105 - 172.29.58.106 - 172.29.58.107 - 172.29.58.108 - 172.29.58.109 - 172.29.58.110 - 172.29.58.111 - 172.29.58.112 - 172.29.58.113 - 172.29.58.114 - 172.29.58.115 - 172.29.58.116 - 172.29.58.117 - 172.29.58.118 - 172.29.58.119 - 172.29.58.120 - 172.29.58.121 - 172.29.58.122 - 172.29.58.123 - 172.29.58.124 - 172.29.58.125 - 172.29.58.126 - 172.29.58.127 - 172.29.58.128 - 172.29.58.129 - 172.29.58.130 - 172.29.58.131 - 172.29.58.132 - 172.29.58.133 - 172.29.58.134 - 172.29.58.135 - 172.29.58.136 - 172.29.58.137 - 172.29.58.138 - 172.29.58.139 - 172.29.58.140 - 172.29.58.141 - 172.29.58.142 - 172.29.58.143 - 172.29.58.144 - 172.29.58.145 - 172.29.58.146 - 172.29.58.147 - 172.29.58.148 - 172.29.58.149 - 172.29.58.150 - 172.29.58.151 - 172.29.58.152 - 172.29.58.153 - 172.29.58.154 - 172.29.58.155 - 172.29.58.156 - 172.29.58.157 - 172.29.58.158 - 172.29.58.159 - 172.29.58.160 - 172.29.58.161 - 172.29.58.162 - 172.29.58.163 - 172.29.58.164 - 172.29.58.165 - 172.29.58.166 - 172.29.58.167 - 172.29.58.168 - 172.29.58.169 - 172.29.58.170 - 172.29.58.171 - 172.29.58.172 - 172.29.58.173 - 172.29.58.174 - 172.29.58.175 - 172.29.58.176 - 172.29.58.177 - 172.29.58.178 - 172.29.58.179 - 172.29.58.180 - 172.29.58.181 - 172.29.58.182 - 172.29.58.183 - 172.29.58.184 - 172.29.58.185 - 172.29.58.186 - 172.29.58.187 - 172.29.58.188 rsync_exclude: - '**/.git' - '**/.git/**' rsync_filter: - .gitignore setup_commands: [] upscaling_speed: 1.0 worker_setup_commands: [] worker_start_ray_commands: - ray stop - ray start --address=$RAY_HEAD_IP:6379 ```

Issue Severity

High: It blocks me from completing my task.

rkooo567 commented 2 years ago

your zip file seems to be empty when I downloaded.

What's the log from dashboard_agent.log? We need to understand why this component failed!

pratkpranav commented 2 years ago

Mac messed it up earlier i guess. Could you please check it now?

rkooo567 commented 2 years ago

Would you be willing to help us debugging after we merge https://github.com/ray-project/ray/pull/29540? You can use the nightly wheel and give us another log when things fail.

rkooo567 commented 2 years ago

We will also try reproducing it ourselves, but from the repro script, it seems to be more env-specific issue than Ray (like the agent detects if parent is dead via psutil.parent, but this seems to sometimes not work in your env based on your logs. Raylet will kill itself if agent is killed).

pratkpranav commented 2 years ago

Sure

rkooo567 commented 2 years ago

I will ping you after the PR is merged!

rkooo567 commented 2 years ago

Okay, I merged the PR. The wheel will be available in 2~3 hours. You can get the download link from https://docs.ray.io/en/master/ray-overview/installation.html#daily-releases-nightlies. Please make sure the commit is after 98da294f0b8fbf0ab9d94ebd552c4d78b4d3aaf8.

You can try 2 things.

Try reproducing it with the nightly wheel and give me the log of dashbord_agent.log
Try doing the same thing with this env var. RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=2. You can start your ray like this to apply the env var. RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=2 ray start --head and RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=2 ray start --address=<head_ip>

rkooo567 commented 2 years ago

cc @pratkpranav lmk how this goes! I will downgrade the priority until you get back to me

pratkpranav commented 2 years ago

Sure! It is running, will try to get back to you with required logs, by today.

pratkpranav commented 2 years ago

@rkooo567 the program has been running for a while now without any crashes. I see some nodes warning of lagging heartbeats(outputs added). I have observed the other warning(which causes the actor to kill itself) tends to appear only after the program has been running for some time(maybe a day or two). I will keep it running over the weekend and get back to you as I hit the issue again.

Output Logs

``` (raylet, ip=172.29.58.101) [2022-10-27 16:30:03,850 E 98551 98592] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. 2022-10-27 17:53:40,197 WARNING worker.py:1839 -- The node with node id: b1d7d50a82a27c9cdb73c6c5a27cf355243895e54636ae698b31d1bf and address: 172.29.58.101 and node name: 172.29.58.101 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (scheduler +5h5m58s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (scheduler +5h5m58s) Restarting 1 nodes of type local.cluster.node (lost contact with raylet). (raylet, ip=172.29.58.101) [2022-10-27 16:30:03,850 E 98551 98592] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. (raylet, ip=172.29.58.106) [2022-10-27 19:30:01,262 E 74990 75037] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. 2022-10-27 20:53:35,494 WARNING worker.py:1839 -- The node with node id: c2ea103cc18b283a45c6ce53f4cc49110abf9ab459b71b3633c5b0ed and address: 172.29.58.106 and node name: 172.29.58.106 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (scheduler +8h5m53s) Restarting 1 nodes of type local.cluster.node (lost contact with raylet). (raylet, ip=172.29.58.106) [2022-10-27 19:30:01,262 E 74990 75037] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. (raylet, ip=172.29.58.156) [2022-10-27 22:04:42,140 E 71180 71226] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. 2022-10-27 23:28:14,465 WARNING worker.py:1839 -- The node with node id: d923bc381d8b6ee1fc0e64ea12ee62841c64e84184ae64c68f43cb4e and address: 172.29.58.156 and node name: 172.29.58.156 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (scheduler +10h40m36s) Restarting 1 nodes of type local.cluster.node (lost contact with raylet). (raylet, ip=172.29.58.156) [2022-10-27 22:04:42,140 E 71180 71226] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. (raylet, ip=172.29.58.101) [2022-10-27 23:31:02,301 E 99825 99866] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. (raylet, ip=172.29.58.118) [2022-10-28 00:05:22,500 E 75647 75693] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. 2022-10-28 00:54:40,283 WARNING worker.py:1839 -- The node with node id: 49ad90bc62a60013e651e68b91cc20c1c58e47662dc193e679b8b995 and address: 172.29.58.101 and node name: 172.29.58.101 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (scheduler +12h7m0s) Restarting 1 nodes of type local.cluster.node (lost contact with raylet). (raylet, ip=172.29.58.101) [2022-10-27 16:30:03,850 E 98551 98592] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. (raylet, ip=172.29.58.101) [2022-10-27 23:31:02,301 E 99825 99866] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. ```

rkooo567 commented 2 years ago

2022-10-27 17:53:40,197 WARNING worker.py:1839 -- The node with node id: b1d7d50a82a27c9cdb73c6c5a27cf355243895e54636ae698b31d1bf and address: 172.29.58.101 and node name: 172.29.58.101 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a    (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.

So this happens because of the agent failure. This is the flow;

agent failed -> raylet failed -> heartbeat not reported -> heartbeat lagging error.

I think I can remove this false negative error message. The original issue remains the same (agent failed because it thinks the raylet is failed. But raylet is actually alive).

Btw I added a kind of "speculative fix" into my PR. So if you cannot reproduce the issue, it probably means the fix was working. So lmk how it goes! Appreciate for all the information again :)

pratkpranav commented 2 years ago

@rkooo567 , It failed again, but this time without producing any warning. The program output is added below.

Error Output

``` (raylet, ip=172.29.58.110) [2022-10-28 10:07:32,236 E 81604 81646] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. Traceback (most recent call last): File "mlm_training.py", line 35, in wrapped_model.train() File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/distributed.py", line 204, in train train_state_manager.train_batch(epoch, batch_id) File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py", line 92, in train_batch self._compute_and_store_batch_gradients(batch_id) File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py", line 105, in _compute_and_store_batch_gradients ray.get( File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2295, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. ```

These are the logs from node 110(which failed) and the head node. logs_head.zip logs_110.zip

In the mean time, I will run the program again, with updated env_var RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=2

Also, in the last output log, I added here #29412 (comment); what happened is the program ran till the raylet failed on the node, which is not running any actor. As soon as raylet failed on a node on which an actor is running, that program failed soon after, with samilar logs attached above.

I also see this warning

2022-10-27 17:53:40,197 WARNING worker.py:1839 -- The node with node id: b1d7d50a82a27c9cdb73c6c5a27cf355243895e54636ae698b31d1bf and address: 172.29.58.101 and node name: 172.29.58.101 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a    (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.

produced by a node only a while after the raylet failed(from the output https://github.com/ray-project/ray/issues/29412#issuecomment-1294469675). Is this expected?

rkooo567 commented 2 years ago

produced by a node only a while after the raylet failed(from the output https://github.com/ray-project/ray/issues/29412#issuecomment-1294469675). Is this expected?

Yes. Let me tell you a bit more detail about what is happening here. So in every ray node, there's a raylet. And raylet spawns a child process called "dashboard agent". Although the name has "dashboard", it is actually doing more work than the dashboard things, so you can just think it as an "agent".

When the agent is dead, the raylet is dead. They fate share.

Also in Ray, every raylet sends heartbeat to the central control plane (GCS). If raylet's heartbeat is not delivered to the GCS within 30 seconds, this warning is printed.

2022-10-27 17:53:40,197 WARNING worker.py:1839 -- The node with node id: b1d7d50a82a27c9cdb73c6c5a27cf355243895e54636ae698b31d1bf and address: 172.29.58.101 and node name: 172.29.58.101 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a    (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.

Your workloads consistently fail with this order.

agent thinks the raylet is dead when it is not -> agent kills itself because it thinks raylet is dead -> raylet kills itself because the agent is dead. In this case, you are expected to see the above warning because raylet is dead ungracefully, and the heartbeat won't be sent to the GCS.

And this is the code how agent detects the parent's status (it keeps checking the parent's pid): https://github.com/ray-project/ray/blob/d1662d68b96803c629ff982268563a14688646d3/dashboard/agent.py#L174. Apparently this detecting code doesn't seem to work in your environment for some unknown reasons, and I am trying to figure out why,

It looks like the agent is dead because of different exceptions from your latest run. Unfortunately, exception is not logged from our code...

2022-10-28 10:07:30,636 ERROR agent.py:260 -- Failed to check parent PID, exiting.

I think we may need a couple more iterations of merge code with more debugging info -> you run and tell me how this goes since it seems to be hard to be reproduced in my env... Would this be okay? I will create another PR to add more information in this case.

pratkpranav commented 2 years ago

Sure, Thanks for the help.

rkooo567 commented 2 years ago

I will lyk after this PR is merged https://github.com/ray-project/ray/pull/29802

rkooo567 commented 2 years ago

okay. it is merged. https://github.com/ray-project/ray/pull/29802

Maybe you can try this again in 3~4 hours!

pratkpranav commented 2 years ago

@rkooo567, I have added the logs for the run after the last PR(https://github.com/ray-project/ray/pull/29802) was merged.

Now, the traceback is visible.

2022-10-31 02:35:10,558 ERROR agent.py:260 -- Failed to check parent PID, exiting.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py", line 185, in _check_parent
    init_assigned_for_parent = parent.pid == 1
AttributeError: 'NoneType' object has no attribute 'pid'

Error Output

``` (raylet, ip=172.29.58.150) [2022-10-31 02:35:12,143 E 88707 88753] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. Traceback (most recent call last): File "mlm_training.py", line 35, in wrapped_model.train() File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/distributed.py", line 204, in train train_state_manager.train_batch(epoch, batch_id) File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py", line 92, in train_batch self._compute_and_store_batch_gradients(batch_id) File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py", line 105, in _compute_and_store_batch_gradients ray.get( File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2295, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. ```

These are the logs from node 150(which failed) and the head node. logs_head.zip logs_150.zip

I also ran the program, with updated env_var RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=2, the error was same.

rkooo567 commented 2 years ago

ah this was my bad... this one seems to be a bug from the new code.. Let me fix it real quick. I will get back to you soon again... Thanks for trying this again!

rkooo567 commented 2 years ago

Fix from ^ PR... If your workload still fails after this PR, we may need to change the mechanism to detect the parent failure (now we rely on psutil)

rkooo567 commented 2 years ago

@pratkpranav Can you try this one more time with the latest master? I think this will be the last time before I decide how to fix the issues. Thanks again for your help!!

rkooo567 commented 2 years ago

There are 2 potential solutions here;

If the nightly fixes your issues with RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S, maybe we will go with this direction.
If not, we may change the health check mechanism from agent -> raylet because that seems unreliable

pratkpranav commented 2 years ago

Thanks for the help. The program is running, I will add a followup here as it fails or succeed.

pratkpranav commented 2 years ago

@rkooo567, It failed again.

Error Output

``` 2022-11-02 02:42:14,808 WARNING worker.py:1849 -- Raylet is terminated: ip=172.29.58.133, id=6ad361c6d67a3180a1a9ce9e00b1b6f371aab0dd0bd32052ebbaac63. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] NodeManager.deadline_timer.debug_state_dump - 306 total (1 active), CPU time: mean = 1.470 ms, total = 449.737 ms [state-dump] NodeManager.deadline_timer.record_metrics - 204 total (1 active), CPU time: mean = 609.296 us, total = 124.296 ms [state-dump] NodeResourceInfoGcsService.grpc_client.GetResources - 74 total (0 active), CPU time: mean = 39.516 us, total = 2.924 ms [state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 73 total (1 active), CPU time: mean = 209.898 us, total = 15.323 ms [state-dump] Subscriber.HandlePublishedMessage_GCS_WORKER_DELTA_CHANNEL - 67 total (0 active), CPU time: mean = 1.286 us, total = 86.180 us [state-dump] Subscriber.HandlePublishedMessage_GCS_NODE_INFO_CHANNEL - 57 total (0 active), CPU time: mean = 132.101 us, total = 7.530 ms [state-dump] NodeManager.deadline_timer.print_event_loop_stats - 51 total (1 active, 1 running), CPU time: mean = 1.973 ms, total = 100.613 ms [state-dump] PeriodicalRunner.RunFnPeriodically - 8 total (0 active), CPU time: mean = 328.886 us, total = 2.631 ms [state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 3 total (0 active), CPU time: mean = 659.577 us, total = 1.979 ms [state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 128.344 us, total = 256.687 us [state-dump] RuntimeEnvService.grpc_client.GetOrCreateRuntimeEnv - 2 total (0 active), CPU time: mean = 86.844 us, total = 173.688 us [state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 13.598 ms, total = 13.598 ms [state-dump] NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 821.980 us, total = 821.980 us [state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 364.980 us, total = 364.980 us [state-dump] RuntimeEnvService.grpc_client.DeleteRuntimeEnvIfPossible - 1 total (0 active), CPU time: mean = 18.151 us, total = 18.151 us [state-dump] JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 14.571 us, total = 14.571 us [state-dump] AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 315.289 us, total = 315.289 us [state-dump] DebugString() time ms: 0 [state-dump] [state-dump] (raylet, ip=172.29.58.133) [2022-11-02 02:42:16,100 E 10907 10953] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. 2022-11-02 06:00:25,665 WARNING worker.py:1849 -- Raylet is terminated: ip=172.29.58.162, id=c598de8f58ccaa7660cedcf4f3cfa4e36f1e5ad749bbc2d489f4edcd. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] ObjectManager.ObjectAdded - 4 total (0 active), CPU time: mean = 203.535 us, total = 814.138 us [state-dump] RuntimeEnvService.grpc_client.GetOrCreateRuntimeEnv - 4 total (0 active), CPU time: mean = 1.902 ms, total = 7.607 ms [state-dump] ObjectManager.ObjectDeleted - 3 total (0 active), CPU time: mean = 128.689 us, total = 386.067 us [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch - 3 total (0 active), CPU time: mean = 31.851 us, total = 95.552 us [state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 3 total (0 active), CPU time: mean = 273.225 us, total = 819.675 us [state-dump] NodeManagerService.grpc_server.GetSystemConfig - 2 total (0 active), CPU time: mean = 98.013 us, total = 196.027 us [state-dump] ObjectManager.HandlePull - 2 total (0 active), CPU time: mean = 698.587 us, total = 1.397 ms [state-dump] RuntimeEnvService.grpc_client.DeleteRuntimeEnvIfPossible - 2 total (0 active), CPU time: mean = 20.584 us, total = 41.168 us [state-dump] ClientConnection.async_write.DoAsyncWrites - 2 total (0 active), CPU time: mean = 1.170 us, total = 2.340 us [state-dump] NodeManagerService.grpc_server.RequestWorkerLease - 2 total (0 active), CPU time: mean = 763.734 us, total = 1.527 ms [state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 81.439 us, total = 162.879 us [state-dump] NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 1.836 ms, total = 1.836 ms [state-dump] JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 6.981 us, total = 6.981 us [state-dump] AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 276.626 us, total = 276.626 us [state-dump] WorkerInfoGcsService.grpc_client.ReportWorkerFailure - 1 total (0 active), CPU time: mean = 32.881 us, total = 32.881 us [state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 11.612 ms, total = 11.612 ms [state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 289.639 us, total = 289.639 us [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] (raylet, ip=172.29.58.162) [2022-11-02 06:00:26,113 E 11598 11644] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause. Traceback (most recent call last): File "mlm_training.py", line 35, in wrapped_model.train() File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/distributed.py", line 204, in train train_state_manager.train_batch(epoch, batch_id) File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py", line 93, in train_batch self._communicate() File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py", line 129, in _communicate ray.get([worker.receive_gradients.remote() for worker in self.workers]) File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2303, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::ReplicaWorker.receive_gradients() (pid=11712, ip=172.29.58.149, repr=) File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/worker.py", line 16, in wrapper result = f(*args, **kwds) File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/worker.py", line 214, in receive_gradients self.comm.receive_gradients() File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/communication/gloo.py", line 36, in receive_gradients col.allreduce( File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective.py", line 273, in allreduce g.allreduce([tensor], opts) File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 252, in allreduce self._collective(tensors, tensors, collective_fn) File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 466, in _collective collective_fn(input_tensors[0], output_tensors[0], self._gloo_context) File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 243, in collective_fn pygloo.allreduce( RuntimeError: [/root/.cache/bazel/_bazel_root/22470c32b6b05656d37eca60778f3711/sandbox/linux-sandbox/4/execroot/pygloo/external/gloo/gloo/transport/tcp/pair.cc:589] Read error [172.29.58.162]:35237: Connection reset by peer ```

logs_head.zip logs_162.zip(node that failed) logs_133.zip(node that failed, but no actor was running on it)

rkooo567 commented 2 years ago

Hmm.. it's interesting. So we are using psutil.Parent (https://psutil.readthedocs.io/en/latest/) to see if the parent process is alive and if this returns None, we consider the parent is dead (which means the parent pid is not known anymore). The issue now is although the parent (raylet) is alive, psutil.parent returns None (so we misunderstand the raylet is dead and agent kills itself which kills the raylet because they fate share).

I assume there may be a subtle bug in psutil that returns the incorrect result. This is their parent implementation. https://github.com/giampaolo/psutil/blob/08964b08700fcfc5a4917a01107009a3ee913341/psutil/__init__.py#L464

While I will talk about the potential solution with my collegue, do you mind trying one more time with higher RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S? It looks like this env var is not set based on your current error message. (maybe very long value like 30~60). You should set this on every node (e.g., RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=60 ray start --address= & RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=60 ray start --head)

Also for the temporary workaround, if you don't mind having process leak (agent.py), you can also set really really long env var RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=10000000. Then this will disable the parent health checking. This may solve your issue temporarily, but then when the raylet is actually dead, the agent cannot detect the parent is dead.

rkooo567 commented 2 years ago

Lmk how this went!

pratkpranav commented 2 years ago

I have put it to run with env RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=100000000 for now. It's been running for a day and half now. I would keep it running for some more time.

I will try to put it to run as soon as I am done running with the above one.

do you mind trying one more time with higher RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S? It looks like this env var is not set based on your current error message. (maybe very long value like 30~60)

rkooo567 commented 2 years ago

Sounds good to me! I'd be also great if you test with a shorter time like 60 seconds (RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=60)! Please note that what you did means you disabled the health check, and agent may not be terminated properly although raylet is killed. In the medium term, we will see if it's possible to find a better health checking mechanism..

pratkpranav commented 2 years ago

@rkooo567 , I ran the program again, with RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=60. However, It failed, as soon as the sleep ends. Adding error output and all the log files from worker that failed..

Error Output:

2022-11-07 07:07:50,455 WARNING worker.py:1849 -- Raylet is terminated: ip=172.29.58.112, id=3a909badf8a1e1b4db6cc6052adf53453851d0b188d4b59c3bfb4969. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        CoreWorkerService.grpc_client.PubsubCommandBatch - 4 total (0 active), CPU time: mean = 115.877 us, total = 463.509 us
    [state-dump]        ObjectManagerService.grpc_client.Pull - 2 total (0 active), CPU time: mean = 20.016 us, total = 40.032 us
    [state-dump]        RuntimeEnvService.grpc_client.GetOrCreateRuntimeEnv - 2 total (0 active), CPU time: mean = 8.369 ms, total = 16.738 ms
    [state-dump]        ObjectManager.ObjectAdded - 2 total (0 active), CPU time: mean = 1.101 ms, total = 2.203 ms
    [state-dump]        InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 85.853 us, total = 171.706 us
    [state-dump]        NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 564.471 us, total = 564.471 us
    [state-dump]        JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 5.382 us, total = 5.382 us
    [state-dump]        NodeManagerService.grpc_server.GetSystemConfig - 1 total (0 active), CPU time: mean = 97.424 us, total = 97.424 us
    [state-dump]        NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 12.787 ms, total = 12.787 ms
    [state-dump]        ClientConnection.async_write.DoAsyncWrites - 1 total (0 active), CPU time: mean = 546.000 ns, total = 546.000 ns
    [state-dump]        Subscriber.HandlePublishedMessage_GCS_WORKER_DELTA_CHANNEL - 1 total (0 active), CPU time: mean = 5.633 us, total = 5.633 us
    [state-dump]        Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 1 total (0 active), CPU time: mean = 717.672 us, total = 717.672 us
    [state-dump]        CoreWorkerService.grpc_client.UpdateObjectLocationBatch - 1 total (0 active), CPU time: mean = 10.926 us, total = 10.926 us
    [state-dump]        AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 228.876 us, total = 228.876 us
    [state-dump]        NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 274.577 us, total = 274.577 us
    [state-dump]        ObjectManager.ObjectDeleted - 1 total (0 active), CPU time: mean = 16.534 us, total = 16.534 us
    [state-dump]        ObjectManager.HandlePull - 1 total (0 active), CPU time: mean = 905.430 us, total = 905.430 us
    [state-dump] DebugString() time ms: 1
    [state-dump] 
    [state-dump] 

(raylet, ip=172.29.58.112) [2022-11-07 07:07:51,336 E 59435 59476] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
Traceback (most recent call last):
  File "mlm_training.py", line 35, in <module>
    wrapped_model.train()
  File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/distributed.py", line 204, in train
    train_state_manager.train_batch(epoch, batch_id)
  File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py", line 92, in train_batch
    self._compute_and_store_batch_gradients(batch_id)
  File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py", line 105, in _compute_and_store_batch_gradients
    ray.get(
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2305, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

logs_112.zip

rkooo567 commented 2 years ago

Got it. I am not sure what causes this, but psutil definitely seems broken in your env...

I think you can get around the issue by RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=100000000 for now. I will discuss the fix plan and get back to you...

pratkpranav commented 1 year ago

Hey @rkooo567, it seems we are again getting a similar issue, even with the RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=100000000. Were there any recent changes that might have caused this issue to resurface?

rkooo567 commented 1 year ago

Hmm we didn't make any change lately. Did you upgrade the version? this basically disables the feature, so it shouldn't actually happen anymore?

pratkpranav commented 1 year ago

Yes, I did upgrade the version. I tried 2.2.0, 2.3.0, and 2.4.0, and see similar issues. The last time we made it work with the daily wheels, I worked on that for a while but as soon as I moved any of those versions I see that error.

pratkpranav commented 1 year ago

@rkooo567, When I moved to 2.4.0, Following is the error log from dashboard_agent.log

Error log from terminal:


    while distributed_trainer.step():
  File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/distributed.py", line 506, in step
    has_next_batch = self.train_state_manager.train_batch(epoch=self.current_epoch)
  File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py", line 119, in train_batch
    all_workers_have_next_batch = self._compute_and_store_next_batch_gradients()
  File "/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py", line 144, in _compute_and_store_next_batch_gradients
    has_next_batches = ray.get(
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2523, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Error from dashboard_agent.log from the failed worker.

2023-05-24 08:08:53,268 ERROR reporter_agent.py:1112 -- Error publishing node physical stats.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1653, in wrapper
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1885, in memory_info
    with open_binary("%s/%s/statm" % (self._procfs_path, self.pid)) as f:
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_common.py", line 765, in open_binary
    return open(fname, "rb", buffering=FILE_READ_BUFFER_SIZE)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/39122/statm'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1095, in _perform_iteration
    stats = self._get_all_stats()
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 626, in _get_all_stats
    "workers": self._get_workers(),
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 512, in _get_workers
    return [
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 513, in <listcomp>
    w.as_dict(
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/__init__.py", line 528, in as_dict
    ret = meth()
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/__init__.py", line 1083, in memory_full_info
    return self._proc.memory_full_info()
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1951, in memory_full_info
    basic_mem = self.memory_info()
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1660, in wrapper
    raise NoSuchProcess(self.pid, self._name)
psutil.NoSuchProcess: process no longer exists (pid=39122)

Last 100 lines from dashboard_agent.log from failed worker

``` ==> /tmp/ray/session_latest/logs/dashboard_agent.log <== 2023-05-24 01:41:12,121 INFO agent.py:116 -- Parent pid is 39024 2023-05-24 01:41:12,122 INFO agent.py:142 -- Dashboard agent grpc address: 0.0.0.0:61123 2023-05-24 01:41:12,124 INFO utils.py:112 -- Get all modules by type: DashboardAgentModule 2023-05-24 01:41:12,455 INFO utils.py:145 -- Available modules: [, , , , , , , ] 2023-05-24 01:41:12,455 INFO agent.py:171 -- Loading DashboardAgentModule: 2023-05-24 01:41:12,455 INFO event_agent.py:38 -- Event agent cache buffer size: 10240 2023-05-24 01:41:12,455 INFO agent.py:171 -- Loading DashboardAgentModule: 2023-05-24 01:41:12,455 INFO agent.py:171 -- Loading DashboardAgentModule: 2023-05-24 01:41:12,455 INFO agent.py:171 -- Loading DashboardAgentModule: 2023-05-24 01:41:12,455 INFO agent.py:171 -- Loading DashboardAgentModule: 2023-05-24 01:41:12,455 INFO agent.py:171 -- Loading DashboardAgentModule: 2023-05-24 01:41:12,460 INFO agent.py:171 -- Loading DashboardAgentModule: 2023-05-24 01:41:12,461 INFO agent.py:171 -- Loading DashboardAgentModule: 2023-05-24 01:41:12,461 INFO agent.py:176 -- Loaded 8 modules. 2023-05-24 01:41:12,466 INFO http_server_agent.py:74 -- Dashboard agent http address: 0.0.0.0:52365 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,466 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,467 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,467 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,467 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,467 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,467 INFO http_server_agent.py:81 -- -> 2023-05-24 01:41:12,467 INFO http_server_agent.py:81 -- -> > 2023-05-24 01:41:12,467 INFO http_server_agent.py:81 -- PosixPath('/tmp/ray/session_2023-05-24_00-40-46_570983_249787/logs')> -> PosixPath('/tmp/ray/session_2023-05-24_00-40-46_570983_249787/logs')>> 2023-05-24 01:41:12,467 INFO http_server_agent.py:81 -- PosixPath('/tmp/ray/session_2023-05-24_00-40-46_570983_249787/logs')> -> > 2023-05-24 01:41:12,467 INFO http_server_agent.py:82 -- Registered 30 routes. 2023-05-24 01:41:12,473 INFO event_agent.py:56 -- Report events to 172.29.216.212:45015 2023-05-24 01:41:12,475 INFO event_utils.py:132 -- Monitor events logs modified after 1684905072.2749348 on /tmp/ray/session_2023-05-24_00-40-46_570983_249787/logs/events, the source types are all. 2023-05-24 08:08:53,268 ERROR reporter_agent.py:1112 -- Error publishing node physical stats. Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1653, in wrapper return fun(self, *args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1885, in memory_info with open_binary("%s/%s/statm" % (self._procfs_path, self.pid)) as f: File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_common.py", line 765, in open_binary return open(fname, "rb", buffering=FILE_READ_BUFFER_SIZE) FileNotFoundError: [Errno 2] No such file or directory: '/proc/39122/statm' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1095, in _perform_iteration stats = self._get_all_stats() File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 626, in _get_all_stats "workers": self._get_workers(), File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 512, in _get_workers return [ File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 513, in w.as_dict( File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/__init__.py", line 528, in as_dict ret = meth() File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/__init__.py", line 1083, in memory_full_info return self._proc.memory_full_info() File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1951, in memory_full_info basic_mem = self.memory_info() File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1660, in wrapper raise NoSuchProcess(self.pid, self._name) psutil.NoSuchProcess: process no longer exists (pid=39122) 2023-05-24 08:10:44,850 INFO metrics_agent.py:245 -- Metrics from a worker (7341f7e82bb0c8bf47f0f4d7250f38b1c3106002b6fc6457a06d3b09) is cleaned up due to timeout. Time since last report 121.55140731693245s ```

pratkpranav commented 1 year ago

@rkooo567 I just wanted to check whether you had time to look into the error.

alejandroarmas commented 1 year ago

Any updates? I am experiencing the same issue

rkooo567 commented 1 year ago

Do you guys see the issue in the latest version? (ray 2.6)

garylavayou commented 11 months ago

Running a mini-cluster with Ray 2.8.1, this problem still occurs.

Here is the recent logs of dashboard_agent:

2023-12-24 16:56:38,169 INFO process_watcher.py:44 -- raylet pid is 41
2023-12-24 16:56:38,175 INFO event_agent.py:56 -- Report events to 172.29.1.2:33455
2023-12-24 16:56:38,176 INFO event_utils.py:132 -- Monitor events logs modified after 1703463998.0062358 on /tmp/ray/session_2023-12-24_16-56-33_688310_1/logs/events, the source types are all.
2023-12-25 20:39:55,379 ERROR process_watcher.py:158 -- Failed to check parent PID, exiting.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/process_watcher.py", line 73, in _check_parent
    parent = curr_proc.parent()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/thirdparty_files/psutil/__init__.py", line 564, in parent
    ppid = self.ppid()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/thirdparty_files/psutil/_common.py", line 486, in wrapper
    raise raise_from(err, None)
  File "<string>", line 3, in raise_from
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/thirdparty_files/psutil/_common.py", line 484, in wrapper
    return fun(self)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/thirdparty_files/psutil/__init__.py", line 623, in ppid
    self._raise_if_pid_reused()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/thirdparty_files/psutil/__init__.py", line 436, in _raise_if_pid_reused
    raise NoSuchProcess(self.pid, self._name, msg=msg)
psutil.NoSuchProcess: process no longer exists and its PID has been reused (pid=83)

in which, the process_watcher, one of the thread of dashboard_agent process complains about its parent (i.e., raylet) no longer exists. We can see that the initial PID of raylet is 41, but the error message show the PID is 83, and hence the psutil may have some bug for recording its parent process.

Almost the same time, the raylet also detected that the dashboard_agent has gone, showing in the following logs:

[2023-12-25 20:39:55,615 I 41 84] (raylet) agent_manager.cc:66: Agent process with name dashboard_agent/1059961393 exited, exit code 0.
[2023-12-25 20:39:55,615 E 41 84] (raylet) agent_manager.cc:70: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent/1059961393.
[2023-12-25 20:39:55,615 I 41 84] (raylet) raylet_util.h:28: Sending SIGTERM to gracefully shutdown raylet
[2023-12-25 20:39:55,615 I 41 41] (raylet) main.cc:368: Raylet received SIGTERM, shutting down...
[2023-12-25 20:39:55,615 I 41 41] (raylet) accessor.cc:451: Unregistering node info, node id = 0cbd570687c28f137a1c5acb08026740602efeb0bc95ed5a9d2abbb3
[2023-12-25 20:39:55,616 I 41 41] (raylet) agent_manager.cc:95: Killing agent dashboard_agent/1059961393, pid 83.
[2023-12-25 20:39:55,616 I 41 41] (raylet) agent_manager.cc:95: Killing agent runtime_env_agent, pid 85.
[2023-12-25 20:39:55,620 I 41 86] (raylet) agent_manager.cc:66: Agent process with name runtime_env_agent exited, exit code 0.

The dashboard_agent.log says that "PID has been reused", but my cluster is running in docker, and has no workload, so there should be no new process being created that reuses the PID of the died raylet. Otherwise, the raylet might not be truly died, but could the detection mechanism of process_watcher have some flaws?

Besides, I run the cluster with docker compose with the following compose file.

services:
  head:  # <service name>
    # - official image: rayproject/ray:2.8.1-py310-cpu
    #   tags: 2.8.1, 2.8.1-py310-cpu, 2.8.1-py310-gpu
    # - custom image: ray:2.8.1-py310-cpu
    image: ray:2.8.1-py310-cpu
    container_name: ray-head
    hostname: ray-head
    networks:
      - net
    ports:
      - 8265:8265
      - 10001:10001
    entrypoint: ["ray", "start"]
    command: 
      - --head
      - --port=6379
      - --dashboard-host=0.0.0.0
      - --block
      - --disable-usage-stats

    environment:
      # Detecting docker specified CPUs. In previous versions of Ray, CPU detection 
      # in containers was incorrect. Please ensure that Ray has enough CPUs allocated. 
      # As a temporary workaround to revert to the prior behavior, 
      # set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. 
      # Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
      - RAY_DISABLE_DOCKER_CPU_WARNING=1
      # - RAY_USE_MULTIPROCESSING_CPU_COUNT=1
    shm_size: 2g
    deploy:
      resources:
        limits:
          cpus: "2"
          memory: 4096m
        reservations:
          cpus: "1"
          memory: 1024m
    # run_options:   # Extra options to pass into "docker run"
    #     - --ulimit nofile=65536:65536

  worker:  # <service name>
    image: ray:2.8.1-py310-cpu
    shm_size: 2g
    depends_on:
      - head
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: "2"
          memory: 2048m
        reservations:
          cpus: "1"
          memory: 1024m
    networks:
      - net
    entrypoint: ["ray", "start"]
    command: 
      - --address=ray-head:6379
      - --block
      - --disable-usage-stats

    environment:
      - RAY_DISABLE_DOCKER_CPU_WARNING=1

networks:
  net:
    ipam:
      driver: default
      config:
        - subnet: "172.29.1.0/24"

With customized image build with the following docker file:

FROM rayproject/ray:2.8.1-py310-cpu

USER root
ENV TZ='Asia/Shanghai'
RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && \
    echo "$TZ" > /etc/timezone
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt-get update && apt-get install --yes ca-certificates 

# base image is based on utuntu:focal
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt-get update && \
    apt-get install --no-install-recommends --yes \
    gcc libc-dev openjdk-11-jre
ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64

USER ray
RUN --mount=type=cache,target=/home/ray/.cache/pip,sharing=locked \
    pip install --index-url https://pypi.doubanio.com/simple/ \
        'raydp==1.6.0' \
        'dask>=2022.10.1' \
        'modin[ray]'

garylavayou commented 11 months ago

Running another instance with version 2.6.3. The bug is a little different from 2.8.1. Here are related logs:

UID        PID  PPID  C STIME TTY          TIME CMD
ray          1     0  0 Dec27 ?        00:00:20 /bin/bash /home/ray/startup.sh worker
ray         37     1  1 Dec27 ?        00:20:10 .../ray/raylet/raylet...
ray         38     1  0 Dec27 ?        00:12:00 ...log_monitor.py...
ray         78    37  1 Dec27 ?        00:19:19 ...dashboard/agent.py
ray        118     1  0 Dec27 ?        00:00:05 tail -f /tmp/ray/session_latest/logs/raylet.out
ray        328     0  0 Dec27 pts/0    00:00:00 /bin/sh
ray        336   328  0 Dec27 pts/0    00:00:00 bash
ray      33142     1  0 13:07 ?        00:00:00 sleep 5
ray      33143   336  0 13:07 pts/0    00:00:00 ps -ef

2023-12-28 13:08:13,049 ERROR reporter_agent.py:1112 -- Error publishing node physical stats.
Traceback (most recent call last): 
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1095, in _perform_iteration
    stats = self._get_all_stats()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 626, in _get_all_stats
    "workers": self._get_workers(),
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 490, in _get_workers
    self._generate_worker_key(proc): proc for proc in raylet_proc.children()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/thirdparty_files/psutil/__init__.py", line 277, in wrapper
    raise NoSuchProcess(self.pid, self._name, msg=msg)
psutil.NoSuchProcess: process no longer exists and its PID has been reused (pid=37)

The raylet process is still there (PID=37), but dashboard_agent detected the process no longer exists and its PID has been reused (pid=37).

Another difference is that the two process does not exit, and keep producing logs like the above.

The compose file is slightly modified to read logs from the startup script.

services:
  head: # <service name>
    image: ray:${RAY_VERSION:-2.6.3-py310-cpu}
    container_name: ray-head
    hostname: ray-head
    networks:
      - net
    ports:
      - 8265:8265
      - 10001:10001
    command: ["head"]

    environment:
      - RAY_DISABLE_DOCKER_CPU_WARNING=1
    shm_size: 2g
    deploy:
      resources:
        limits:
          cpus: "2"
          memory: 4096m
        reservations:
          cpus: "1"
          memory: 1024m

  worker: # <service name>
    image: ray:${RAY_VERSION:-2.6.3-py310-cpu}
    shm_size: 2g
    depends_on:
      - head
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: "2"
          memory: 2048m
        reservations:
          cpus: "1"
          memory: 1024m
    networks:
      - net
    command: ["worker"]
    environment:
      - RAY_DISABLE_DOCKER_CPU_WARNING=1

networks:
  net:
    ipam:
      driver: default
      config:
        - subnet: "172.29.1.0/24"

and the docker file:

ARG RAY_VERSION=2.6.3-py310-cpu
FROM rayproject/ray:${RAY_VERSION}

USER root
ENV TZ='Asia/Shanghai'
RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && \
    echo "$TZ" > /etc/timezone
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt-get update && apt-get install --yes ca-certificates 

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt-get update && \
    apt-get install --no-install-recommends --yes \
    gcc libc-dev openjdk-11-jre
ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64

ADD ./startup.sh /home/ray/
ENTRYPOINT ["/home/ray/startup.sh"]

USER ray
RUN --mount=type=cache,target=/home/ray/.cache/pip,sharing=locked,uid=1000,gid=100 \
    pip install --index-url https://pypi.doubanio.com/simple/ \
        'raydp==1.6.0' \
        'dask>=2022.10.1' \
        'modin[ray]'

with customized startup script startup.sh

#!/bin/bash

service=${1:?'Please specify the service to start!'}
shift
monitor_file='raylet.out'

case $service in
head)
    ray start \
        --head \
        --port=6379 \
        --dashboard-host=0.0.0.0 \
        --disable-usage-stats \
        "${@}"
    ;;
worker)
    ray start \
        --address=ray-head:6379 \
        --disable-usage-stats \
        "${@}"
    ;;
*)
    echo "error: unknown service <${service}>to start!"
    exit 1
    ;;
esac
tail -f "/tmp/ray/session_latest/logs/${monitor_file}" < /dev/null &

while true; do
    if ps -C raylet -f > /dev/null; then
        sleep 5
    else
        echo "warning: raylet process has exited, the monitor process now exit."
        exit 0
    fi
done