open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.28k stars 9.41k forks source link

_pickle.UnpicklingError: pickle data was truncated #5301

Closed YexiongLin closed 3 years ago

YexiongLin commented 3 years ago

I can use python ./tools/train.py to train model, but when I use ./tools/dist_train.sh, it raised 2021-06-06 18:36:09,350 - mmdet - INFO - Start running, host: lyx@lyx-MS-7A71, work_dir: /data/detection/mmdetection/work_dirs/yolov3_d53_fp16_mstrain-608_273e_coco 2021-06-06 18:36:09,350 - mmdet - INFO - workflow: [('train', 1)], max: 273 epochs Traceback (most recent call last): File "", line 1, in File "/home/lyx/anaconda3/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/home/lyx/anaconda3/lib/python3.8/multiprocessing/spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated Traceback (most recent call last): File "/home/lyx/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/lyx/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/lyx/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/lyx/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/home/lyx/anaconda3/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/yolo/yolov3_d53_fp16_mstrain-608_273e_coco.py', '--launcher', 'pytorch']' died with <Signals.SIGKILL: 9>. (base) lyx@lyx-MS-7A71:mmdetection$ /home/lyx/anaconda3/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

My mmdetection version is 2.13.0, mmcv: 1.3.5, pytorch: 1.7.1

RangiLyu commented 3 years ago

Your training process was killed. There are many reasons, and usually, it is because you don't have enough memory.

jh01231230 commented 1 year ago

Your training process was killed. There are many reasons, and usually, it is because you don't have enough memory.

This issue happens when training in multiprocess mode and under a Windows environment. The serialize and deserialize process using pickle may fail when doing fork-based multiprocessing in Windows, which does not support fork-based multiprocessing. There is no known solution for this issue yet, but to disable multiprocessing as a compromise. Here is a thread post for your reference: https://github.com/pytorch/pytorch/issues/69611