tanluren / yolov3-channel-and-layer-pruning

yolov3 yolov4 channel and layer pruning, Knowledge Distillation 层剪枝,通道剪枝,知识蒸馏
Apache License 2.0
1.5k stars 446 forks source link

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #43

Closed kame-lqm closed 4 years ago

kame-lqm commented 4 years ago

我在进行sparsity时,遇到了如下报错,请问是什么原因导致的呢?谢谢。 Error LOG: Reading labels (14363 found, 0 missing, 9 empty for 14372 images): 100%|████████████████████████████████████████████| 14372/14372 [02:20<00:00, 102.01it/s] Model Summary: 225 layers, 6.42767e+07 parameters, 6.42767e+07 gradients Starting training for 120 epochs...

 Epoch   gpu_mem      GIoU       obj       cls     total      soft    rratio   targets  img_size

0%| | 0/450 [00:00<?, ?it/s]learning rate: 1e-06 0/119 6.96G 1.54 1.88 0.968 4.38 0 0 101 416: 100%|██████████████| 450/450 [06:35<00:00, 1.14it/s] Class Images Targets P R mAP F1: 100%|██████████████████████████████████| 113/113 [02:52<00:00, 1.53s/it] all 3.59e+03 1.01e+05 0.322 0.453 0.365 0.363

 Epoch   gpu_mem      GIoU       obj       cls     total      soft    rratio   targets  img_size

0%| | 0/450 [00:00<?, ?it/s]learning rate: 0.0011625 Traceback (most recent call last): File "train.py", line 542, in train() # train normally File "train.py", line 348, in train pred = model(imgs) File "/devdata/liqm/Tools/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/devdata/liqm/Tools/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 459, in forward self.reducer.prepare_for_backward([]) RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /opt/conda/conda-bld/pytorch_1573049387353/work/torch/csrc/distributed/c10d/reducer.cpp:518) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f0d6836f687 in /devdata/liqm/Tools/miniconda3/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator > const&) + 0x7b7 (0x7f0d6de68667 in /devdata/liqm/Tools/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #2: + 0x7cfca1 (0x7f0d6de56ca1 in /devdata/liqm/Tools/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #3: + 0x2065e6 (0x7f0d6d88d5e6 in /devdata/liqm/Tools/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #4: _PyMethodDef_RawFastCallKeywords + 0x254 (0x55be4fe14744 in /devdata/liqm/Tools/miniconda3/bin/python) frame #5: _PyCFunction_FastCallKeywords + 0x21 (0x55be4fe14861 in /devdata/liqm/Tools/miniconda3/bin/python) frame #6: _PyEval_EvalFrameDefault + 0x52f8 (0x55be4fe806e8 in /devdata/liqm/Tools/miniconda3/bin/python) frame #7: _PyEval_EvalCodeWithName + 0x2f9 (0x55be4fdc4539 in /devdata/liqm/Tools/miniconda3/bin/python) frame #8: _PyFunction_FastCallDict + 0x1d5 (0x55be4fdc5635 in /devdata/liqm/Tools/miniconda3/bin/python) frame #9: _PyObject_Call_Prepend + 0x63 (0x55be4fde3e53 in /devdata/liqm/Tools/miniconda3/bin/python) frame #10: PyObject_Call + 0x6e (0x55be4fdd6dbe in /devdata/liqm/Tools/miniconda3/bin/python) frame #11: _PyEval_EvalFrameDefault + 0x1e42 (0x55be4fe7d232 in /devdata/liqm/Tools/miniconda3/bin/python) frame #12: _PyEval_EvalCodeWithName + 0x2f9 (0x55be4fdc4539 in /devdata/liqm/Tools/miniconda3/bin/python) frame #13: _PyFunction_FastCallDict + 0x1d5 (0x55be4fdc5635 in /devdata/liqm/Tools/miniconda3/bin/python) frame #14: _PyObject_Call_Prepend + 0x63 (0x55be4fde3e53 in /devdata/liqm/Tools/miniconda3/bin/python) frame #15: + 0x16ba3a (0x55be4fe1ba3a in /devdata/liqm/Tools/miniconda3/bin/python) frame #16: _PyObject_FastCallKeywords + 0x49b (0x55be4fe1c8fb in /devdata/liqm/Tools/miniconda3/bin/python) frame #17: _PyEval_EvalFrameDefault + 0x4a96 (0x55be4fe7fe86 in /devdata/liqm/Tools/miniconda3/bin/python) frame #18: _PyEval_EvalCodeWithName + 0xac9 (0x55be4fdc4d09 in /devdata/liqm/Tools/miniconda3/bin/python) frame #19: _PyFunction_FastCallKeywords + 0x387 (0x55be4fe13f57 in /devdata/liqm/Tools/miniconda3/bin/python) frame #20: _PyEval_EvalFrameDefault + 0x416 (0x55be4fe7b806 in /devdata/liqm/Tools/miniconda3/bin/python) frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x55be4fdc4539 in /devdata/liqm/Tools/miniconda3/bin/python) frame #22: PyEval_EvalCodeEx + 0x44 (0x55be4fdc5424 in /devdata/liqm/Tools/miniconda3/bin/python) frame #23: PyEval_EvalCode + 0x1c (0x55be4fdc544c in /devdata/liqm/Tools/miniconda3/bin/python) frame #24: + 0x22ab74 (0x55be4fedab74 in /devdata/liqm/Tools/miniconda3/bin/python) frame #25: PyRun_FileExFlags + 0xa1 (0x55be4fee4eb1 in /devdata/liqm/Tools/miniconda3/bin/python) frame #26: PyRun_SimpleFileExFlags + 0x1c3 (0x55be4fee50a3 in /devdata/liqm/Tools/miniconda3/bin/python) frame #27: + 0x236195 (0x55be4fee6195 in /devdata/liqm/Tools/miniconda3/bin/python) frame #28: _Py_UnixMain + 0x3c (0x55be4fee62bc in /devdata/liqm/Tools/miniconda3/bin/python) frame #29: __libc_start_main + 0xf0 (0x7f0d71697830 in /lib/x86_64-linux-gnu/libc.so.6) frame #30: + 0x1db062 (0x55be4fe8b062 in /devdata/liqm/Tools/miniconda3/bin/python)

kame-lqm commented 4 years ago

已经搞定了,主要还是要在train.py里面修改如下一行代码就OK了。测试发现稀疏训练跟剪枝都有效果。0.5剪枝基本没有掉点,甚至还涨了一丢丢。 model = torch.nn.parallel.DistributedDataParallel(model) 改为 model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)