real-stanford / flingbot

[CoRL 2021 Best System Paper] This repository contains code for training and evaluating FlingBot in both simulation and real-world settings on a dual-UR5 robot arm setup for Ubuntu 18.04
https://flingbot.cs.columbia.edu/
106 stars 25 forks source link

RayActorError: The actor died unexpectedly before finishing this task. #7

Closed licheng198 closed 1 year ago

licheng198 commented 1 year ago

ID: fffffffffffffffffd5f641d1e7ed592e00f065c01000000 Worker ID: 20556ec92d7abbb0b2bf7fb0d363e865ab7587196b608cb3605c3f2f Node ID: 75d7f57c65e988cab741a0c5412548fb08f8fcb7e118eb82fc38fc4d Worker IP address: 172.17.0.2 Worker port: 37185 Worker PID: 1224 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. Traceback (most recent call last): File "run_sim.py", line 46, in envs, task_loader = setup_envs(dataset=dataset_path, *vars(args)) File "/workspace/flingbot/utils.py", line 158, in setup_envs ray.get([e.setup_ray.remote(e) for e in envs]) File "/home/li/anaconda3/envs/flingbot/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(args, **kwargs) File "/home/li/anaconda3/envs/flingbot/lib/python3.6/site-packages/ray/_private/worker.py", line 2277, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: SimEnv actor_id: fd5f641d1e7ed592e00f065c01000000 pid: 1224 namespace: 79413bf6-9d63-48f7-baa3-20f27c337fe9 ip: 172.17.0.2 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. The actor never ran - it was cancelled before it started running.

ElcarimQAQ commented 1 year ago

Have you solved the problem yet? I had the same problem.

licheng198 commented 1 year ago

I haven't solved the problem yet.

Spphire commented 1 year ago

I haven't solved the problem yet.

Have you solved the problem yet?

licheng198 commented 1 year ago

I'm sorry I haven't solved the problem yet.

------------------ 原始邮件 ------------------ 发件人: "columbia-ai-robotics/flingbot" @.>; 发送时间: 2023年4月13日(星期四) 晚上8:49 @.>; @.**@.>; 主题: Re: [columbia-ai-robotics/flingbot] RayActorError: The actor died unexpectedly before finishing this task. (Issue #7)

I haven't solved the problem yet.

Have you solved the problem yet?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

zcswdt commented 1 year ago

Have you solved the problem yet? I had the same problem.

Have you solved the problem yet? I had the same problem.thank you!

zcswdt commented 1 year ago

I haven't solved the problem yet.

Have you solved the problem yet?

Have you solved the problem yet? I had the same problem.

huy-ha commented 12 months ago

I recommend tracking RAM usage, since it looks like Ray actors are being killed due to OOM.

If it's true that RAM is the issue, you can use a smaller number of parallel environments and set num_processes accordingly.

If it's not true that the OOM killer has been killing your Ray actors, you can use one environment (num_processes=1) and use local_mode=True when you initialize ray. This should give more informative error messages, and you can debug from there. For instance, the pyflex.init call could fail for many reasons, for which I would refer to the pyflex issues page.

Hope this helps!

zcswdt commented 9 months ago

I recommend tracking RAM usage, since it looks like Ray actors are being killed due to OOM.

If it's true that RAM is the issue, you can use a smaller number of parallel environments and set num_processes accordingly.

If it's not true that the OOM killer has been killing your Ray actors, you can use one environment (num_processes=1) and use local_mode=True when you initialize ray. This should give more informative error messages, and you can debug from there. For instance, the pyflex.init call could fail for many reasons, for which I would refer to the pyflex issues page.

Hope this helps!

Hello, may I ask how many training times it takes to complete the training?

huy-ha commented 9 months ago

Hey there! From the paper:

All policies are trained in simulation until convergence, which takes around 150,000 simulation steps, or 6 days on a machine with a GTX 1080 Ti