real-stanford / flingbot

[CoRL 2021 Best System Paper] This repository contains code for training and evaluating FlingBot in both simulation and real-world settings on a dual-UR5 robot arm setup for Ubuntu 18.04
https://flingbot.cs.columbia.edu/
106 stars 25 forks source link

when i run `python run_sim.py', the worker died or was killed by an unexpected system error #2

Open robint-XNF opened 2 years ago

robint-XNF commented 2 years ago

when i run python run_sim.py --eval --tasks flingbot-normal-rect-eval.hdf5 --load flingbot.pth --num_processes 1 --gui the error shows:ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-11-22 15:10:23,194 WARNING worker.py:1228 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff341cd030556402df7c59625701000000 Worker ID: 4f72e151e496fac468e1c730556e291e00ec1cfb29882f51097186fd Node ID: d4a9eb590967aeb63fe838e2eca52cf666565bf009207c0ec4a730e6 Worker IP address: 192.168.1.106 Worker port: 41747 Worker PID: 18687 i don't know why occur this issue, could you please help me?

robint-XNF commented 2 years ago

also ,there is no 'replay_buffer.hdf5' in the 'fingbot_eval_X'

Jeffery-Zhou commented 2 years ago

when i run python run_sim.py --eval --tasks flingbot-normal-rect-eval.hdf5 --load flingbot.pth --num_processes 1 --gui the error shows:ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-11-22 15:10:23,194 WARNING worker.py:1228 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff341cd030556402df7c59625701000000 Worker ID: 4f72e151e496fac468e1c730556e291e00ec1cfb29882f51097186fd Node ID: d4a9eb590967aeb63fe838e2eca52cf666565bf009207c0ec4a730e6 Worker IP address: 192.168.1.106 Worker port: 41747 Worker PID: 18687 i don't know why occur this issue, could you please help me?

I met the same issue, I thought it's the issue of ray version, but it turned out to other issues after testing. Have you solved that right now?

gtegner commented 2 years ago

Hey, I got the same error and looking through the ray logs, it's because it can't find the GPU. To fix this, theres a line in utils.setup_envs:

    envs = [ray.remote(SimEnv).options(
        num_gpus=torch.cuda.device_count()/num_processes,
        num_cpus=0.1).remote(
        replay_buffer_path=dataset,
        get_task_fn=lambda: ray.get(task_loader.get_next_task.remote()),
        **kwargs)
        for _ in range(num_processes)]

The problem is that torch is installed on cpu, which gives torch.cuda.device_count() == 0 and consequently num_gpus=0. Hardcoding this to be equal to 1 (or whatever number of GPUs you're using) fixes the problem!

Barbany commented 2 years ago

Instead of hardcoding the number of GPUs:

I found out that the PyTorch installation and the cuda drivers installed by the flingbot.yml file are not properly set up. Notice that the boolean torch.cuda.is_available() is False. You can solve this by re-installing PyTorch using pip, which already packs compatible cuda drivers. Now verify that torch.cuda.is_available() is True and the device count is correct.

scarlett-sun commented 1 year ago

My problem is when running the evaluation command, it seems like when the animation finishes, the terminal stops at "Evaluating flingbot.pth: saving to flingbot_eval_X/replay_buffer.hdf5", and no changes happen, the replay_buffer.hdf5 is not seen in the directory.

zcswdt commented 1 year ago

Instead of hardcoding the number of GPUs:

I found out that the PyTorch installation and the cuda drivers installed by the flingbot.yml file are not properly set up. Notice that the boolean torch.cuda.is_available() is False. You can solve this by re-installing PyTorch using pip, which already packs compatible cuda drivers. Now verify that torch.cuda.is_available() is True and the device count is correct.

Have you successfully run the code for this warehouse?