open-mmlab / mmcv

OpenMMLab Computer Vision Foundation
https://mmcv.readthedocs.io/en/latest/
Apache License 2.0
5.91k stars 1.65k forks source link

[Bug] MMdistributedDataparallel 多卡加载数据报错,单卡没有问题 #3020

Open cqray1990 opened 10 months ago

cqray1990 commented 10 months ago

Prerequisite

Environment

mmcv 1.4

Reproduces the problem - code sample

当数据量比较大的时候,多卡加载数据出现一下错误,单卡加载训练没有问题: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1436 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 1435) of binary: anaconda3/envs/ocr_env/bin/python Traceback (most recent call last): File "anaconda3/envs/ocr_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "anaconda3/envs/ocr_env/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in main() File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

.MMDAVAR/tools/train.py FAILED

Failures:

----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-01-21_21:54:57 host : ocr-PR2715P2-MS-S212 rank : 0 (local_rank: 0) exitcode : -9 (pid: 1435) error_file: traceback : Signal 9 (SIGKILL) received by PID 1435 减少部分数据的时候错误消失, ### Reproduces the problem - command or script python -m torch.distributed.launch --nproc_per_node 2 train.py --launcher pytorch ### Reproduces the problem - error message 当数据量比较大的时候,多卡加载数据出现一下错误,单卡加载训练没有问题: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1436 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 1435) of binary: anaconda3/envs/ocr_env/bin/python Traceback (most recent call last): File "anaconda3/envs/ocr_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "anaconda3/envs/ocr_env/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in main() File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "anaconda3/envs/ocr_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== .MMDAVAR/tools/train.py FAILED ----------------------------------------------------- Failures: ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-01-21_21:54:57 host : ocr-PR2715P2-MS-S212 rank : 0 (local_rank: 0) exitcode : -9 (pid: 1435) error_file: traceback : Signal 9 (SIGKILL) received by PID 1435 减少部分数据的时候错误消失, ### Additional information _No response_