huggingface / lerobot

🤗 LeRobot: Making AI for Robotics more accessible with end-to-end learning
Apache License 2.0
6.91k stars 632 forks source link

when try run the examples always winit related errors happened #214

Closed All-embracing closed 4 months ago

All-embracing commented 4 months ago

EDITED (for readability)

System Info

Dear authors,

I appreciate with your excellent jobs and I want some help from you. When I try to use the Dockerfile in lerobot project repo to make a new image even with cpu version or gpu version, then try to run the example such as

python lerobot/scripts/visualize_dataset.py --repo-id lerobot/pusht --episode-index 0

it always happened some winit related errors like

"WARNING:datasets.packaged_modules.cache.cache:Found** the latest cached dataset configuration 'default' at /root/.cache/huggingface/datasets/lerobot___pusht/default/0.0.0/31a41e121d12821207155735793fb9175d29a5dd (last modified on Thu May 23 08:18:41 2024).
[2024-05-27T03:19:52Z INFO  re_sdk_comms::server] Hosting a SDK server over TCP at 0.0.0.0:9876. Connect with the Rerun logging SDK.
Error: winit EventLoopError: os error at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/winit-0.29.9/src/platform_impl/linux/wayland/event_loop/mod.rs:81: Could not find wayland compositor -> os error at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/winit-0.29.9/src/platform_impl/linux/wayland/event_loop/mod.rs:81: Could not find wayland compositor
[2024-05-27T03:19:53Z WARN  re_sdk_comms::buffered_client] Failed to send message after 3 attempts: Failed to connect to Rerun server at 127.0.0.1:9876: Connection refused (os error 111)
[2024-05-27T03:19:55Z WARN  re_sdk_comms::buffered_client] Dropping messages because tcp client has timed out.
[2024-05-27T03:19:55Z WARN  re_sdk_comms::buffered_client] Dropping messages because tcp client has timed out.
[2024-05-27T03:19:55Z WARN  re_sdk_comms::tcp_client] Tried to flush while TCP stream was still Pending. Data was possibly dropped.

However I try to install winit v0.30.0 or update the cargo, this environment problems can't be solved. Can u give me some advices?

Thanks very much.

Information

Reproduction

  1. Using the cpu or gpu dockerfile to make a lerobot docker image, like
    sudo docker build -t lebot_docker_cpu:v1.0 -f ./docker/lerobot-cpu/Dockerfile . > docker_image.log 2>&1 &

    2.After build the lebot_docker_cpu:v1.0 successfully, then use it like

    docker run -it lebot_docker_cpu:v1.0 /bin/bash

    3.After entering the container just created, reproduce example like

    python lerobot/scripts/visualize_dataset.py --repo-id lerobot/pusht --episode-index 0

    4.When run or try rerun the example, it always report error like

    WARNING:datasets.packaged_modules.cache.cache:Found the latest cached dataset configuration 'default' at /root/.cache/huggingface/datasets/lerobot___pusht/default/0.0.0/31a41e121d12821207155735793fb9175d29a5dd (last modified on Thu May 23 08:18:41 2024).
    [2024-05-27T03:19:52Z INFO  re_sdk_comms::server] Hosting a SDK server over TCP at 0.0.0.0:9876. Connect with the Rerun logging SDK.
    Error: winit EventLoopError: os error at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/winit-0.29.9/src/platform_impl/linux/wayland/event_loop/mod.rs:81: Could not find wayland compositor -> os error at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/winit-0.29.9/src/platform_impl/linux/wayland/event_loop/mod.rs:81: Could not find wayland compositor
    [2024-05-27T03:19:53Z WARN  re_sdk_comms::buffered_client] Failed to send message after 3 attempts: Failed to connect to Rerun server at 127.0.0.1:9876: Connection refused (os error 111)
    [2024-05-27T03:19:55Z WARN  re_sdk_comms::buffered_client] Dropping messages because tcp client has timed out.
    [2024-05-27T03:19:55Z WARN  re_sdk_comms::buffered_client] Dropping messages because tcp client has timed out.
    [2024-05-27T03:19:55Z WARN  re_sdk_comms::tcp_client] Tried to flush while TCP stream was still Pending. Data was possibly dropped.

Expected behavior

I will very appreciate some details about how to use the dockerfile in our repo and run the examples successfully. Thanks a lot for our dear authors.

aliberts commented 4 months ago

Hi, it looks like you need to open the correct ports on your container to be able to connect to the rerun server. Are you able to run the other examples/scripts btw?

All-embracing commented 4 months ago

Thanks Aliberts for your quickly reply. Btw, how can I rightly open the correct ports on my container? I've encountered other errors when I tried the other examples/scripts like below: Case 1, run the eval.py: (venv) root@df9ea94d4779:/lerobot# python lerobot/scripts/eval.py -p lerobot/diffusion_pusht eval.n_episodes=10 eval.batch_size=10 config.yaml: 2.38kB [00:00, 6.92MB/s] README.md: 3.03kB [00:00, 8.38MB/s] eval_avg_max_reward.csv: 537B [00:00, 1.88MB/s] .gitattributes: 1.52kB [00:00, 641kB/s] eval_pc_success.csv: 100%|████████████████████████| 236/236 [00:00<00:00, 357kB/s] config.json: 1.01kB [00:00, 3.19MB/s] eval_info.json: 83.9kB [00:00, 944kB/s] | 0.00/236 [00:00<?, ?B/s] demo.gif: 100%|████████████████████████████████| 601k/601k [00:00<00:00, 1.63MB/s] train_loss.csv: 20.7kB [00:00, 23.9MB/s] | 5/11 [00:01<00:01, 5.62it/s] training_curves.png: 100%|████████████████████| 28.6k/28.6k [00:00<00:00, 405kB/s] model.safetensors: 100%|█████████████████████| 1.05G/1.05G [03:29<00:00, 5.01MB/s] Fetching 11 files: 100%|██████████████████████████| 11/11 [03:31<00:00, 19.25s/it] Traceback (most recent call last): File "/lerobot/lerobot/scripts/eval.py", line 621, in eval(pretrained_policy_path=pretrained_policy_path, config_overrides=args.overrides) File "/lerobot/lerobot/scripts/eval.py", line 524, in eval device = get_safe_torch_device(hydra_cfg.device, log=True) File "/opt/venv/lib/python3.10/site-packages/lerobot/common/utils/utils.py", line 34, in get_safe_torch_device assert torch.cuda.is_available() AssertionError

Case 2, run the train.py: (venv) root@df9ea94d4779:/lerobot# python lerobot/scripts/train.py policy=act env=aloha env.task=AlohaInsertion-v0 dataset_repo_id=lerobot/aloha_sim_insertion_human Error executing job with overrides: ['policy=act', 'env=aloha', 'env.task=AlohaInsertion-v0', 'dataset_repo_id=lerobot/aloha_sim_insertion_human'] Traceback (most recent call last): File "/lerobot/lerobot/scripts/train.py", line 145, in train_cli train( File "/lerobot/lerobot/scripts/train.py", line 244, in train device = get_safe_torch_device(cfg.device, log=True) File "/opt/venv/lib/python3.10/site-packages/lerobot/common/utils/utils.py", line 34, in get_safe_torch_device assert torch.cuda.is_available() AssertionError

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

aliberts commented 4 months ago

assert torch.cuda.is_available()

These are expected since you are running the cpu image. You need to pass the device=cpu option when running these scripts on cpu.

As for the docker ports, you need to publish them with the -p option within your docker run command. You can learn more about it here.

Btw, you can use these characters to format your shell/code output & scripts, it really helps for reading it 😉 ```bash your terminal output ```

All-embracing commented 4 months ago

Thanks Aliberts for your nice advices. I've tried some times on this two cases. First case using train.py script with 'device=cpu', it can run with bash errors like below(even I think the shared memory is big ~(kernel.shmall = 18446744073692774399 kernel.shmmax = 18446744073692774399)):

(venv) root@ec8ea176d2d7:/lerobot# python lerobot/scripts/train.py policy=act env=aloha env.task=AlohaInsertion-v0 dataset_repo_id=lerobot/aloha_sim_insertion_human device=cpu
WARNING 2024-05-29 04:07:23 ils/utils.py:42 Using CPU, this will be slow.
INFO 2024-05-29 04:07:23 ts/train.py:250 make_dataset
Using the latest cached version of the dataset since lerobot/aloha_sim_insertion_human couldn't be found on the Hugging Face Hub
WARNING 2024-05-29 04:09:03 ts/load.py:1631 Using the latest cached version of the dataset since lerobot/aloha_sim_insertion_human couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /root/.cache/huggingface/datasets/lerobot___aloha_sim_insertion_human/default/0.0.0/3cdeec058acd3e3583e9c22c24e6c3338b8cd712 (last modified on Wed May 29 03:37:57 2024).
WARNING 2024-05-29 04:09:03 che/cache.py:95 Found the latest cached dataset configuration 'default' at /root/.cache/huggingface/datasets/lerobot___aloha_sim_insertion_human/default/0.0.0/3cdeec058acd3e3583e9c22c24e6c3338b8cd712 (last modified on Wed May 29 03:37:57 2024).
INFO 2024-05-29 04:11:45 ts/train.py:253 make_env
INFO 2024-05-29 04:11:45 /__init__.py:84 MUJOCO_GL=%s, attempting to import specified OpenGL backend.
INFO 2024-05-29 04:11:45 pesloader.py:70 Failed to load library ( %r ): %s
INFO 2024-05-29 04:11:45 /__init__.py:31 MuJoCo library version is: %s
INFO 2024-05-29 04:11:49 ts/train.py:256 make_policy
INFO 2024-05-29 04:11:49 on/logger.py:66 Logs will be saved locally.
INFO 2024-05-29 04:11:49 on/logger.py:31 Output dir: outputs/train/2024-05-29/04-07-23_aloha_act_default
INFO 2024-05-29 04:11:49 ts/train.py:271 cfg.env.task='AlohaInsertion-v0'
INFO 2024-05-29 04:11:49 ts/train.py:272 cfg.training.offline_steps=80000 (80K)
INFO 2024-05-29 04:11:49 ts/train.py:273 cfg.training.online_steps=0
INFO 2024-05-29 04:11:49 ts/train.py:274 offline_dataset.num_samples=17000 (17K)
INFO 2024-05-29 04:11:49 ts/train.py:275 offline_dataset.num_episodes=34
INFO 2024-05-29 04:11:49 ts/train.py:276 num_learnable_params=51613582 (52M)
INFO 2024-05-29 04:11:49 ts/train.py:277 num_total_params=51613672 (52M)
INFO 2024-05-29 04:11:49 ts/train.py:324 Start offline training on a fixed dataset
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Error executing job with overrides: ['policy=act', 'env=aloha', 'env.task=AlohaInsertion-v0', 'dataset_repo_id=lerobot/aloha_sim_insertion_human', 'device=cpu']
Traceback (most recent call last):
  File "/lerobot/lerobot/scripts/train.py", line 145, in train_cli
    train(
  File "/lerobot/lerobot/scripts/train.py", line 330, in train
    train_info = update_policy(
  File "/lerobot/lerobot/scripts/train.py", line 103, in update_policy
    output_dict = policy.forward(batch)
  File "/opt/venv/lib/python3.10/site-packages/lerobot/common/policies/act/modeling_act.py", line 140, in forward
    actions_hat, (mu_hat, log_sigma_x2_hat) = self.model(batch)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/lerobot/common/policies/act/modeling_act.py", line 330, in forward
    cam_features = self.backbone(images[:, cam_index])["feature_map"]
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torchvision/models/_utils.py", line 69, in forward
    x = module(x)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/pooling.py", line 164, in forward
    return F.max_pool2d(input, self.kernel_size, self.stride,
  File "/opt/venv/lib/python3.10/site-packages/torch/_jit_internal.py", line 497, in fn
    return if_false(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 796, in _max_pool2d
    return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
  File "/opt/venv/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1280) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Second case with script visualize_dataset.py, it still encountered winit errors with open the container port 9876 like below:

nio@nio-HP-Z4-G4-Workstation:~$ docker run -p 9876:9876 -it lebot_docker_cpu:v1.0 /bin/bash
(venv) root@ec8ea176d2d7:/lerobot# export DISPLAY=:0
(venv) root@ec8ea176d2d7:/lerobot# python lerobot/scripts/visualize_dataset.py     --repo-id lerobot/pusht     --episode-index 0
Fetching 222 files: 100%|█████████████████████| 222/222 [00:00<00:00, 2820.27it/s]
[2024-05-29T03:27:25Z INFO  re_sdk_comms::server] Hosting a SDK server over TCP at 0.0.0.0:9876. Connect with the Rerun logging SDK.
Error: winit EventLoopError: the requested operation is not supported by Winit -> the requested operation is not supported by Winit
[2024-05-29T03:27:26Z WARN  re_sdk_comms::buffered_client] Failed to send message after 3 attempts: Failed to connect to Rerun server at 127.0.0.1:9876: Connection refused (os error 111)
[2024-05-29T03:27:29Z WARN  re_sdk_comms::buffered_client] Dropping messages because tcp client has timed out.
[2024-05-29T03:27:29Z WARN  re_sdk_comms::buffered_client] Dropping messages because tcp client has timed out.
[2024-05-29T03:27:29Z WARN  re_sdk_comms::tcp_client] Tried to flush while TCP stream was still Pending. Data was possibly dropped.
100%|███████████████████████████████████████████████| 6/6 [00:00<00:00,  9.73it/s]

Btw in the host, I checked the port had opened like:

~$ netstat -anp|grep 9876
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:9876            0.0.0.0:*               LISTEN      -
tcp6       0      0 :::9876                 :::*                    LISTEN
aliberts commented 4 months ago

For the first issue, you need to increase shared memory on your container with this option on your docker run --shm-size "16gb" (16GB should be more than enough if your hardware allows it, if not try to set it as high as you can)

For the visualization, you need to use your container as a distant machine to host the rerun server, you can find the instructions on how to do that here (you'll still need to have rerun installed locally). Although we didn't intend for this to be used from within a container as these are used for testing, it should work.

All-embracing commented 4 months ago

Thanks to Aliberts. With your new guidelines, I have successfully executed the script, and your distant mode using container is now functioning properly. Thanks again for your assistance.