yiiizuo commented 3 months ago

When running on multiple GPUs, setting num_workers greater than 0 can result in the following errors:

rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data rank7: data = self._data_queue.get(timeout=timeout) rank7: File "/opt/conda/envs/opensora/lib/python3.10/queue.py", line 180, in get

rank7: File "/opt/conda/envs/opensora/lib/python3.10/threading.py", line 324, in wait rank7: gotit = waiter.acquire(True, timeout) rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler

rank7: RuntimeError: DataLoader worker (pid 742) is killed by signal: Killed.

rank7: The above exception was the direct cause of the following exception:

rank7: Traceback (most recent call last): rank7: File "/zuoyi/T2V/Code/Open-Sora/scripts/train.py", line 408, in

api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-08_16:09:17 host : pytorch-5c5a1b99-8zzgx rank : 7 (local_rank: 7) exitcode : 1 (pid: 85) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

chehx commented 3 months ago

Meet the same problem, it seems that the memory usage is too large

JThh commented 3 months ago

This should indeed be an issue with excessive memory usage, as suggested by @chehx.

In your main script, can you try adding this line mp.set_start_method('forkserver', force=True) to see if it has any effects? Otherwise, you may have to stick with num_workers=0.

yiiizuo commented 3 months ago

Thank you, I have already resolved it

chehx commented 3 months ago

Thank you, I have already resolved it

Which works? mp.set_start_method('forkserver', force=True) OR num_worker = 0?

tonney007 commented 3 months ago

谢谢，我已经解决了遇到了同样的问题，请问怎么解决的？

tonney007 commented 3 months ago

谢谢，我已经解决了

哪个有效？mp.set_start_method('forkserver', force=True) 或 num_worker = 0？

我是只能 num_worker = 0

hpcaitech / Open-Sora

When running on multiple GPUs, setting num_workers to be greater than 0 can cause RuntimeError: DataLoader worker (pid 742) is killed by signal: Killed. #431

scripts/train.py FAILED