open-mmlab / mmaction

An open-source toolbox for action understanding based on PyTorch
https://open-mmlab.github.io/
Apache License 2.0
1.86k stars 352 forks source link

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #150

Open ethanliuzhuo opened 4 years ago

ethanliuzhuo commented 4 years ago

I try to train a model using UTF101 dataset. My GPU is GTX 2060S,

and try to input: ./tools/dist_train_recognizer.sh configs/TSN/ucf101/tsn_rgb_bninception.py 1--validate

NOT work

2020-04-03 15:40:26,326 - INFO - Distributed training: True 2020-04-03 15:40:26,506 - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

2020-04-03 15:40:29,750 - INFO - Start running, host: ethan@ethan-ideacentre-Y700-34ISH, work_dir: /media/ethan/D/Project/mmaction/work_dirs/tsn_2d_rgb_bninception_seg_3_f1s1_b32_g8 2020-04-03 15:40:29,750 - INFO - workflow: [('train', 1)], max: 80 epochs Traceback (most recent call last): File "./tools/train_recognizer.py", line 90, in main() File "./tools/train_recognizer.py", line 86, in main logger=logger) File "/media/ethan/D/Project/mmaction/mmaction/apis/train.py", line 58, in train_network _dist_train(model, dataset, cfg, validate=validate) File "/media/ethan/D/Project/mmaction/mmaction/apis/train.py", line 103, in _dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/mmcv-0.4.2-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 359, in run epoch_runner(data_loaders[i], kwargs) File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/mmcv-0.4.2-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 263, in train self.model, data_batch, train_mode=True, kwargs) File "/media/ethan/D/Project/mmaction/mmaction/apis/train.py", line 37, in batch_processor losses = model(*data) File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, **kwargs) File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 392, in forward self.reducer.prepare_for_backward([]) RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /opt/conda/conda-bld/pytorch_1556653114079/work/torch/csrc/distributed/c10d/reducer.cpp:408) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f26c088bdc5 in /home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator > const&) + 0x5ff (0x7f26efe3dbbf in /home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #2: + 0x6cb6c8 (0x7f26efe336c8 in /home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #3: + 0x12d07a (0x7f26ef89507a in /home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #4: _PyMethodDef_RawFastCallKeywords + 0x264 (0x5578a8593ab4 in /home/ethan/anaconda3/envs/torch/bin/python) frame #5: _PyCFunction_FastCallKeywords + 0x21 (0x5578a8593bd1 in /home/ethan/anaconda3/envs/torch/bin/python) frame #6: _PyEval_EvalFrameDefault + 0x5389 (0x5578a85faa39 in /home/ethan/anaconda3/envs/torch/bin/python) frame #7: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python) frame #8: _PyFunction_FastCallDict + 0x3ff (0x5578a85406ef in /home/ethan/anaconda3/envs/torch/bin/python) frame #9: _PyObject_Call_Prepend + 0x63 (0x5578a855fa73 in /home/ethan/anaconda3/envs/torch/bin/python) frame #10: PyObject_Call + 0x6e (0x5578a8551fde in /home/ethan/anaconda3/envs/torch/bin/python) frame #11: _PyEval_EvalFrameDefault + 0x1e9d (0x5578a85f754d in /home/ethan/anaconda3/envs/torch/bin/python) frame #12: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python) frame #13: _PyFunction_FastCallDict + 0x3ff (0x5578a85406ef in /home/ethan/anaconda3/envs/torch/bin/python) frame #14: _PyObject_Call_Prepend + 0x63 (0x5578a855fa73 in /home/ethan/anaconda3/envs/torch/bin/python) frame #15: + 0x17d27a (0x5578a85a727a in /home/ethan/anaconda3/envs/torch/bin/python) frame #16: PyObject_Call + 0x6e (0x5578a8551fde in /home/ethan/anaconda3/envs/torch/bin/python) frame #17: _PyEval_EvalFrameDefault + 0x1e9d (0x5578a85f754d in /home/ethan/anaconda3/envs/torch/bin/python) frame #18: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python) frame #19: _PyFunction_FastCallDict + 0x3ff (0x5578a85406ef in /home/ethan/anaconda3/envs/torch/bin/python) frame #20: _PyEval_EvalFrameDefault + 0x1e9d (0x5578a85f754d in /home/ethan/anaconda3/envs/torch/bin/python) frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python) frame #22: _PyFunction_FastCallDict + 0x1d5 (0x5578a85404c5 in /home/ethan/anaconda3/envs/torch/bin/python) frame #23: _PyObject_Call_Prepend + 0x63 (0x5578a855fa73 in /home/ethan/anaconda3/envs/torch/bin/python) frame #24: PyObject_Call + 0x6e (0x5578a8551fde in /home/ethan/anaconda3/envs/torch/bin/python) frame #25: _PyEval_EvalFrameDefault + 0x1e9d (0x5578a85f754d in /home/ethan/anaconda3/envs/torch/bin/python) frame #26: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python) frame #27: _PyFunction_FastCallKeywords + 0x387 (0x5578a85932b7 in /home/ethan/anaconda3/envs/torch/bin/python) frame #28: _PyEval_EvalFrameDefault + 0x690 (0x5578a85f5d40 in /home/ethan/anaconda3/envs/torch/bin/python) frame #29: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python) frame #30: _PyFunction_FastCallKeywords + 0x387 (0x5578a85932b7 in /home/ethan/anaconda3/envs/torch/bin/python) frame #31: _PyEval_EvalFrameDefault + 0x14d4 (0x5578a85f6b84 in /home/ethan/anaconda3/envs/torch/bin/python) frame #32: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python) frame #33: _PyFunction_FastCallKeywords + 0x387 (0x5578a85932b7 in /home/ethan/anaconda3/envs/torch/bin/python) frame #34: _PyEval_EvalFrameDefault + 0x14d4 (0x5578a85f6b84 in /home/ethan/anaconda3/envs/torch/bin/python) frame #35: _PyFunction_FastCallKeywords + 0xfb (0x5578a859302b in /home/ethan/anaconda3/envs/torch/bin/python) frame #36: _PyEval_EvalFrameDefault + 0x416 (0x5578a85f5ac6 in /home/ethan/anaconda3/envs/torch/bin/python) frame #37: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python) frame #38: PyEval_EvalCodeEx + 0x44 (0x5578a85402b4 in /home/ethan/anaconda3/envs/torch/bin/python) frame #39: PyEval_EvalCode + 0x1c (0x5578a85402dc in /home/ethan/anaconda3/envs/torch/bin/python) frame #40: + 0x22c664 (0x5578a8656664 in /home/ethan/anaconda3/envs/torch/bin/python) frame #41: PyRun_FileExFlags + 0xa1 (0x5578a8660a91 in /home/ethan/anaconda3/envs/torch/bin/python) frame #42: PyRun_SimpleFileExFlags + 0x1c3 (0x5578a8660c83 in /home/ethan/anaconda3/envs/torch/bin/python) frame #43: + 0x237db5 (0x5578a8661db5 in /home/ethan/anaconda3/envs/torch/bin/python) frame #44: _Py_UnixMain + 0x3c (0x5578a8661edc in /home/ethan/anaconda3/envs/torch/bin/python) frame #45: __libc_start_main + 0xe7 (0x7f26fee80b97 in /lib/x86_64-linux-gnu/libc.so.6) frame #46: + 0x1db3e0 (0x5578a86053e0 in /home/ethan/anaconda3/envs/torch/bin/python)

Traceback (most recent call last): File "/home/ethan/anaconda3/envs/torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ethan/anaconda3/envs/torch/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in main() File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/home/ethan/anaconda3/envs/torch/bin/python', '-u', './tools/train_recognizer.py', '--local_rank=0', 'configs/TSN/ucf101/tsn_rgb_bninception.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

my pytorch version is 1.1.0 any solution?

saifsayed commented 4 years ago

This error is caused if you use the .sh script with argument of 1 for GPU count. I was able to run successfully for single GPU by running train_recognizer.py independently(without using the shell script).

ziming-liu commented 4 years ago

I have same problem. do you fix this ??

VJatla commented 4 years ago

Me too. Having same problem with 2 for GPU.

zt706 commented 3 years ago

The logs:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument _find_unused_parameters=True_ to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable)

fix: https://github.com/open-mmlab/mmaction/blob/c7e3b7c11fb94131be9b48a8e3d510589addc3ce/mmaction/apis/train.py#L73

from: find_unused_parameters = cfg.get('find_unused_parameters', False) to: find_unused_parameters = cfg.get('find_unused_parameters', True)

alchemistwu commented 3 years ago

The logs:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument _find_unused_parameters=True_ to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable)

fix: https://github.com/open-mmlab/mmaction/blob/c7e3b7c11fb94131be9b48a8e3d510589addc3ce/mmaction/apis/train.py#L73

from: find_unused_parameters = cfg.get('find_unused_parameters', False) to: find_unused_parameters = cfg.get('find_unused_parameters', True)

This one works for me!

yeruiqian commented 2 years ago

日志:

RuntimeError:预计在开始新的迭代之前已经完成了先前迭代的缩减。此错误表明您的模块具有未用于生成其输出的参数(forward 的返回值)。您可以通过将关键字参数_find_unused_pa​​rameters=True_传递给 torch.nn.parallel.DistributedDataParallel 来启用未使用参数检测。如果您已经设置了此参数,则分布式数据并行模块无法在模块的 forward 函数的返回值中定位输出张量。报告此问题时请包括您的模块前向返回值的结构(例如列表、字典、可迭代)

使固定:

https://github.com/open-mmlab/mmaction/blob/c7e3b7c11fb94131be9b48a8e3d510589addc3ce/mmaction/apis/train.py#L73

从:find_unused_pa​​rameters = cfg.get('find_unused_pa​​rameters', False) 到:find_unused_pa​​rameters = cfg.get('find_unused_pa​​rameters', True)

good thanks