marmotlab / PRIMAL2

Training code PRIMAL2 - Public Repo
MIT License
150 stars 59 forks source link

ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. #15

Open ekkooee7 opened 5 months ago

ekkooee7 commented 5 months ago

hi, i met this problem when running python driver.py.

Hello World... From global (pid=36500) (imitationRunner pid=37879) Hello World... From global (imitationRunner pid=37879) starting episode 0 on metaAgent 0 (imitationRunner pid=37879) running imitation job 2024-04-02 16:24:19,702 WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff57261355039a445aab5c889701000000 Worker ID: 5319944c466cd717513b05721f5bb35ee9d0bc67636ca45d75ec4b26 Node ID: 9142cb0d3cde6bac61a2c9ea58188ae8f649a46cd4f8ab495df8f181 Worker IP address: 10.26.224.144 Worker port: 46573 Worker PID: 37880 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (imitationRunner pid=37880) cannot allocate memory for thread-local data: ABORT Traceback (most recent call last): File "/home/waz/workspace/PRIMAL2/driver.py", line 338, in <module> main() File "/home/waz/workspace/PRIMAL2/driver.py", line 170, in main jobResults, metrics, info = ray.get(done_id)[0] File "/home/waz/anaconda3/envs/py36/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/waz/anaconda3/envs/py36/lib/python3.6/site-packages/ray/_private/worker.py", line 2523, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: imitationRunner actor_id: 57261355039a445aab5c889701000000 pid: 37880 namespace: 36ab3fad-5802-47dd-a1b7-63dece3b6d68 ip: 10.26.224.144 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. 2024-04-02 16:24:19,788 WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff747a4f29667195d12b49c67b01000000 Worker ID: 7b73a6ec53dcefe2ebdf2886269b2f5c58b0a07f4dba5383bc0bdb60 Node ID: 9142cb0d3cde6bac61a2c9ea58188ae8f649a46cd4f8ab495df8f181 Worker IP address: 10.26.224.144 Worker port: 34091 Worker PID: 37879 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (imitationRunner pid=37879) cannot allocate memory for thread-local data: ABORT

I change the number of agents and threads and i make sure my computation resource is enough(on a server with 2 Xeon silver cpu and 24090+64080). Do you have any idea about this problem?

shanyaolingling commented 3 months ago

你好,请问这个问题您解决了?我也出现了和你一样的的问题。 2024-06-07 18:34:27,860 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffacd284e1f8c50cabcb93ada401000000 Worker ID: 1cb486b8f2dfadfd60473610d32c9df8cd0db6defbe70ea46c362cd5 Node ID: f469e2532f4d5ffb080939624763dd3e02888f81bf30331155b64d0a Worker IP address: 10.4.52.11 Worker port: 39717 Worker PID: 9106 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. Traceback (most recent call last): File "/home/dingyanling/downloads/PRIMAL2-main/driver.py", line 235, in main() File "/home/dingyanling/downloads/PRIMAL2-main/driver.py", line 176, in main jobResults, metrics, info = ray.get(done_id)[0] File "/home/dingyanling/downloads/anaconda3/envs/mapf/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, *kwargs) File "/home/dingyanling/downloads/anaconda3/envs/mapf/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, **kwargs) File "/home/dingyanling/downloads/anaconda3/envs/mapf/lib/python3.9/site-packages/ray/_private/worker.py", line 2565, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: imitationRunner actor_id: acd284e1f8c50cabcb93ada401000000 pid: 9106 namespace: 3682ea21-ae56-46e2-8c7a-f56835c9a85c ip: 10.4.52.11 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

zhx0506 commented 2 months ago

hi, i met this problem when running python driver.py.

Hello World... From global (pid=36500) (imitationRunner pid=37879) Hello World... From global (imitationRunner pid=37879) starting episode 0 on metaAgent 0 (imitationRunner pid=37879) running imitation job 2024-04-02 16:24:19,702 WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff57261355039a445aab5c889701000000 Worker ID: 5319944c466cd717513b05721f5bb35ee9d0bc67636ca45d75ec4b26 Node ID: 9142cb0d3cde6bac61a2c9ea58188ae8f649a46cd4f8ab495df8f181 Worker IP address: 10.26.224.144 Worker port: 46573 Worker PID: 37880 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (imitationRunner pid=37880) cannot allocate memory for thread-local data: ABORT Traceback (most recent call last): File "/home/waz/workspace/PRIMAL2/driver.py", line 338, in <module> main() File "/home/waz/workspace/PRIMAL2/driver.py", line 170, in main jobResults, metrics, info = ray.get(done_id)[0] File "/home/waz/anaconda3/envs/py36/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/waz/anaconda3/envs/py36/lib/python3.6/site-packages/ray/_private/worker.py", line 2523, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: imitationRunner actor_id: 57261355039a445aab5c889701000000 pid: 37880 namespace: 36ab3fad-5802-47dd-a1b7-63dece3b6d68 ip: 10.26.224.144 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. 2024-04-02 16:24:19,788 WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff747a4f29667195d12b49c67b01000000 Worker ID: 7b73a6ec53dcefe2ebdf2886269b2f5c58b0a07f4dba5383bc0bdb60 Node ID: 9142cb0d3cde6bac61a2c9ea58188ae8f649a46cd4f8ab495df8f181 Worker IP address: 10.26.224.144 Worker port: 34091 Worker PID: 37879 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (imitationRunner pid=37879) cannot allocate memory for thread-local data: ABORT

I change the number of agents and threads and i make sure my computation resource is enough(on a server with 2 Xeon silver cpu and 2_4090+6_4080). Do you have any idea about this problem?

你好,我也遇到了一样的问题,请问一下解决了吗?

ekkooee7 commented 2 months ago

没呢 不会搞😢

发自我的iPhone

------------------ Original ------------------ From: zhx0506 @.> Date: Sat, Jun 22, 2024 8:55 PM To: marmotlab/PRIMAL2 @.> Cc: ekko7 @.>, Author @.> Subject: Re: [marmotlab/PRIMAL2] ray.exceptions.RayActorError: The actor diedunexpectedly before finishing this task. (Issue #15)

hi, i met this problem when running python driver.py.

Hello World... From global (pid=36500) (imitationRunner pid=37879) Hello World... From global (imitationRunner pid=37879) starting episode 0 on metaAgent 0 (imitationRunner pid=37879) running imitation job 2024-04-02 16:24:19,702 WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff57261355039a445aab5c889701000000 Worker ID: 5319944c466cd717513b05721f5bb35ee9d0bc67636ca45d75ec4b26 Node ID: 9142cb0d3cde6bac61a2c9ea58188ae8f649a46cd4f8ab495df8f181 Worker IP address: 10.26.224.144 Worker port: 46573 Worker PID: 37880 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (imitationRunner pid=37880) cannot allocate memory for thread-local data: ABORT Traceback (most recent call last): File "/home/waz/workspace/PRIMAL2/driver.py", line 338, in <module> main() File "/home/waz/workspace/PRIMAL2/driver.py", line 170, in main jobResults, metrics, info = ray.get(done_id)[0] File "/home/waz/anaconda3/envs/py36/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/waz/anaconda3/envs/py36/lib/python3.6/site-packages/ray/_private/worker.py", line 2523, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: imitationRunner actor_id: 57261355039a445aab5c889701000000 pid: 37880 namespace: 36ab3fad-5802-47dd-a1b7-63dece3b6d68 ip: 10.26.224.144 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. 2024-04-02 16:24:19,788 WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff747a4f29667195d12b49c67b01000000 Worker ID: 7b73a6ec53dcefe2ebdf2886269b2f5c58b0a07f4dba5383bc0bdb60 Node ID: 9142cb0d3cde6bac61a2c9ea58188ae8f649a46cd4f8ab495df8f181 Worker IP address: 10.26.224.144 Worker port: 34091 Worker PID: 37879 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (imitationRunner pid=37879) cannot allocate memory for thread-local data: ABORT

I change the number of agents and threads and i make sure my computation resource is enough(on a server with 2 Xeon silver cpu and 2_4090+6_4080). Do you have any idea about this problem?

你好,我也遇到了一样的问题,请问一下解决了吗?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

zhx0506 commented 2 months ago

没呢 不会搞😢 发自我的iPhone ------------------ Original ------------------ From: zhx0506 @.> Date: Sat, Jun 22, 2024 8:55 PM To: marmotlab/PRIMAL2 @.> Cc: ekko7 @.>, Author @.> Subject: Re: [marmotlab/PRIMAL2] ray.exceptions.RayActorError: The actor diedunexpectedly before finishing this task. (Issue #15) hi, i met this problem when running python driver.py. Hello World... From global (pid=36500) (imitationRunner pid=37879) Hello World... From global (imitationRunner pid=37879) starting episode 0 on metaAgent 0 (imitationRunner pid=37879) running imitation job 2024-04-02 16:24:19,702 WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff57261355039a445aab5c889701000000 Worker ID: 5319944c466cd717513b05721f5bb35ee9d0bc67636ca45d75ec4b26 Node ID: 9142cb0d3cde6bac61a2c9ea58188ae8f649a46cd4f8ab495df8f181 Worker IP address: 10.26.224.144 Worker port: 46573 Worker PID: 37880 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (imitationRunner pid=37880) cannot allocate memory for thread-local data: ABORT Traceback (most recent call last): File "/home/waz/workspace/PRIMAL2/driver.py", line 338, in <module> main() File "/home/waz/workspace/PRIMAL2/driver.py", line 170, in main jobResults, metrics, info = ray.get(done_id)[0] File "/home/waz/anaconda3/envs/py36/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(args, kwargs) File "/home/waz/anaconda3/envs/py36/lib/python3.6/site-packages/ray/_private/worker.py", line 2523, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: imitationRunner actor_id: 57261355039a445aab5c889701000000 pid: 37880 namespace: 36ab3fad-5802-47dd-a1b7-63dece3b6d68 ip: 10.26.224.144 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. 2024-04-02 16:24:19,788 WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff747a4f29667195d12b49c67b01000000 Worker ID: 7b73a6ec53dcefe2ebdf2886269b2f5c58b0a07f4dba5383bc0bdb60 Node ID: 9142cb0d3cde6bac61a2c9ea58188ae8f649a46cd4f8ab495df8f181 Worker IP address: 10.26.224.144 Worker port: 34091 Worker PID: 37879 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (imitationRunner pid=37879) cannot allocate memory for thread-local data: ABORT I change the number of agents and threads and i make sure my computation resource is enough(on a server with 2 Xeon silver cpu and 2_4090+6_4080). Do you have any idea about this problem? 你好,我也遇到了一样的问题,请问一下解决了吗? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.&g

小伙伴,可以加个微信一起交流学习嘛~我的微信:zhxly1018