Closed yiiizuo closed 3 months ago
Meet the same problem, it seems that the memory usage is too large
This should indeed be an issue with excessive memory usage, as suggested by @chehx.
In your main script, can you try adding this line mp.set_start_method('forkserver', force=True)
to see if it has any effects? Otherwise, you may have to stick with num_workers=0
.
Thank you, I have already resolved it
Thank you, I have already resolved it
Which works? mp.set_start_method('forkserver', force=True) OR num_worker = 0?
谢谢,我已经解决了 遇到了同样的问题,请问怎么解决的?
谢谢,我已经解决了
哪个有效?mp.set_start_method('forkserver', force=True) 或 num_worker = 0?
我是只能 num_worker = 0
When running on multiple GPUs, setting num_workers greater than 0 can result in the following errors:
rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data rank7: data = self._data_queue.get(timeout=timeout) rank7: File "/opt/conda/envs/opensora/lib/python3.10/queue.py", line 180, in get
rank7: File "/opt/conda/envs/opensora/lib/python3.10/threading.py", line 324, in wait rank7: gotit = waiter.acquire(True, timeout) rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
rank7: RuntimeError: DataLoader worker (pid 742) is killed by signal: Killed.
rank7: The above exception was the direct cause of the following exception:
rank7: Traceback (most recent call last): rank7: File "/zuoyi/T2V/Code/Open-Sora/scripts/train.py", line 408, in
rank7: File "/zuoyi/T2V/Code/Open-Sora/scripts/train.py", line 314, in main rank7: for step, batch in pbar: rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/tqdm/std.py", line 1169, in iter rank7: for obj in iterable: rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next rank7: data = self._next_data() rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data rank7: idx, data = self._get_data() rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1285, in _get_data rank7: success, data = self._try_get_data() rank7: File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data rank7: raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e rank7: RuntimeError: DataLoader worker (pid(s) 742) exited unexpectedly W0608 16:09:17.624000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 78 closing signal SIGTERM W0608 16:09:17.628000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 79 closing signal SIGTERM W0608 16:09:17.633000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 80 closing signal SIGTERM W0608 16:09:17.636000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 81 closing signal SIGTERM W0608 16:09:17.640000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 82 closing signal SIGTERM W0608 16:09:17.643000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 83 closing signal SIGTERM W0608 16:09:17.646000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 84 closing signal SIGTERM E0608 16:09:22.054000 139721300797248 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 7 (pid: 85) of binary: /opt/conda/envs/opensora/bin/python Traceback (most recent call last): File "/opt/conda/envs/opensora/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/train.py FAILED
Failures: