Closed YexiongLin closed 3 years ago
Your training process was killed. There are many reasons, and usually, it is because you don't have enough memory.
Your training process was killed. There are many reasons, and usually, it is because you don't have enough memory.
This issue happens when training in multiprocess mode and under a Windows environment. The serialize and deserialize process using pickle may fail when doing fork-based multiprocessing in Windows, which does not support fork-based multiprocessing. There is no known solution for this issue yet, but to disable multiprocessing as a compromise. Here is a thread post for your reference: https://github.com/pytorch/pytorch/issues/69611
I can use python ./tools/train.py to train model, but when I use ./tools/dist_train.sh, it raised 2021-06-06 18:36:09,350 - mmdet - INFO - Start running, host: lyx@lyx-MS-7A71, work_dir: /data/detection/mmdetection/work_dirs/yolov3_d53_fp16_mstrain-608_273e_coco 2021-06-06 18:36:09,350 - mmdet - INFO - workflow: [('train', 1)], max: 273 epochs Traceback (most recent call last): File "", line 1, in
File "/home/lyx/anaconda3/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/home/lyx/anaconda3/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "/home/lyx/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lyx/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lyx/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/lyx/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/lyx/anaconda3/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/yolo/yolov3_d53_fp16_mstrain-608_273e_coco.py', '--launcher', 'pytorch']' died with <Signals.SIGKILL: 9>.
(base) lyx@lyx-MS-7A71:mmdetection$ /home/lyx/anaconda3/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
My mmdetection version is 2.13.0, mmcv: 1.3.5, pytorch: 1.7.1