hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
21.76k stars 2.11k forks source link

When running on multiple GPUs, setting num_workers to be greater than 0 can cause RuntimeError: DataLoader worker (pid 742) is killed by signal: Killed. #431

Closed yiiizuo closed 3 months ago

yiiizuo commented 3 months ago

When running on multiple GPUs, setting num_workers greater than 0 can result in the following errors:

rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data rank7: data = self._data_queue.get(timeout=timeout) rank7: File "/opt/conda/envs/opensora/lib/python3.10/queue.py", line 180, in get

rank7: File "/opt/conda/envs/opensora/lib/python3.10/threading.py", line 324, in wait rank7: gotit = waiter.acquire(True, timeout) rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler

rank7: RuntimeError: DataLoader worker (pid 742) is killed by signal: Killed.

rank7: The above exception was the direct cause of the following exception:

rank7: Traceback (most recent call last): rank7: File "/zuoyi/T2V/Code/Open-Sora/scripts/train.py", line 408, in

rank7: File "/zuoyi/T2V/Code/Open-Sora/scripts/train.py", line 314, in main rank7: for step, batch in pbar: rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/tqdm/std.py", line 1169, in iter rank7: for obj in iterable: rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next rank7: data = self._next_data() rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data rank7: idx, data = self._get_data() rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1285, in _get_data rank7: success, data = self._try_get_data() rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data rank7: raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e rank7: RuntimeError: DataLoader worker (pid(s) 742) exited unexpectedly W0608 16:09:17.624000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 78 closing signal SIGTERM W0608 16:09:17.628000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 79 closing signal SIGTERM W0608 16:09:17.633000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 80 closing signal SIGTERM W0608 16:09:17.636000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 81 closing signal SIGTERM W0608 16:09:17.640000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 82 closing signal SIGTERM W0608 16:09:17.643000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 83 closing signal SIGTERM W0608 16:09:17.646000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 84 closing signal SIGTERM E0608 16:09:22.054000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 7 (pid: 85) of binary: /opt/conda/envs/opensora/bin/python Traceback (most recent call last): File "/opt/conda/envs/opensora/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-08_16:09:17 host : pytorch-5c5a1b99-8zzgx rank : 7 (local_rank: 7) exitcode : 1 (pid: 85) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
chehx commented 3 months ago

Meet the same problem, it seems that the memory usage is too large

JThh commented 3 months ago

This should indeed be an issue with excessive memory usage, as suggested by @chehx.

In your main script, can you try adding this line mp.set_start_method('forkserver', force=True) to see if it has any effects? Otherwise, you may have to stick with num_workers=0.

yiiizuo commented 3 months ago

Thank you, I have already resolved it

chehx commented 3 months ago

Thank you, I have already resolved it

Which works? mp.set_start_method('forkserver', force=True) OR num_worker = 0?

tonney007 commented 3 months ago

谢谢,我已经解决了 遇到了同样的问题,请问怎么解决的?

tonney007 commented 3 months ago

谢谢,我已经解决了

哪个有效?mp.set_start_method('forkserver', force=True) 或 num_worker = 0?

我是只能 num_worker = 0