Shared memory manager connection timeout during training

kellyzxiaowei commented 2 months ago

System Info

I'm encountering a "Shared memory manager connection has timed out" error while training my model. The error occurs during the data loading process, specifically when trying to get the next batch of data. Here are the details:

- Environment: macOS, using MPS (Metal Performance Shaders) device，M3BP
- Python version: 3.10
- Framework: PyTorch
- Error message:

INFO 2024-09-07 14:01:13 ts/train.py:192 step:7K smpl:29K ep:66 epch:13.16 loss:0.567 grdn:41.324 lr:5.0e-06 updt_s:0.233 data_s:0.001
INFO 2024-09-07 14:01:36 ts/train.py:192 step:7K smpl:29K ep:67 epch:13.35 loss:0.577 grdn:49.279 lr:5.0e-06 updt_s:0.230 data_s:0.001
INFO 2024-09-07 14:02:00 ts/train.py:192 step:7K smpl:30K ep:68 epch:13.53 loss:0.568 grdn:42.813 lr:5.0e-06 updt_s:0.235 data_s:0.001
INFO 2024-09-07 14:02:23 ts/train.py:192 step:8K smpl:30K ep:69 epch:13.71 loss:0.759 grdn:50.854 lr:5.0e-06 updt_s:0.230 data_s:0.001
Error executing job with overrides: ['dataset_repo_id=data/koch_test', 'policy=act_koch_real', 'env=koch_real', 'device=mps', 'wandb.enable=false']
Traceback (most recent call last):
  File "/Users/zxw/AITOOL/lerobot/lerobot/scripts/train.py", line 652, in train_cli
    train(
  File "/Users/zxw/AITOOL/lerobot/lerobot/scripts/train.py", line 424, in train
    batch = next(dl_iter)
  File "/Users/zxw/AITOOL/lerobot/lerobot/common/datasets/utils.py", line 398, in cycle
    yield next(iterator)
  File "/Users/zxw/miniconda3/envs/lerobot/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/Users/zxw/miniconda3/envs/lerobot/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
    idx, data = self._get_data()
  File "/Users/zxw/miniconda3/envs/lerobot/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
    success, data = self._try_get_data()
  File "/Users/zxw/miniconda3/envs/lerobot/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/Users/zxw/miniconda3/envs/lerobot/lib/python3.10/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/Users/zxw/miniconda3/envs/lerobot/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 514, in rebuild_storage_filename
    storage = torch.UntypedStorage._new_shared_filename_cpu(manager, handle, size)
RuntimeError: Shared memory manager connection has timed out

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Information

[X] One of the scripts in the examples/ folder of LeRobot
[X] My own task or dataset (give details below)

Reproduction

。

Expected behavior

。

aliberts commented 2 months ago

Hi there, How many workers did you use? Do you have W&B plots showing memory usage by any chance?

CaesarrWANG commented 2 months ago

have you handled this problem? I encountered the same error

kellyzxiaowei commented 1 month ago

have you handled this problem? I encountered the same error

I deleted the conda environment and used python’s virtual environment, and it worked properly.

kellyzxiaowei commented 1 month ago

Hi there, How many workers did you use? Do you have W&B plots showing memory usage by any chance?

It is now working properly.

huggingface / lerobot

Shared memory manager connection timeout during training #421

System Info

Information

Reproduction

Expected behavior