Open ethanliuzhuo opened 4 years ago
This error is caused if you use the .sh script with argument of 1 for GPU count. I was able to run successfully for single GPU by running train_recognizer.py independently(without using the shell script).
I have same problem. do you fix this ??
Me too. Having same problem with 2 for GPU.
The logs:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument _find_unused_parameters=True_ to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable)
from: find_unused_parameters = cfg.get('find_unused_parameters', False) to: find_unused_parameters = cfg.get('find_unused_parameters', True)
The logs:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument _find_unused_parameters=True_ to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable)
from: find_unused_parameters = cfg.get('find_unused_parameters', False) to: find_unused_parameters = cfg.get('find_unused_parameters', True)
This one works for me!
日志:
RuntimeError:预计在开始新的迭代之前已经完成了先前迭代的缩减。此错误表明您的模块具有未用于生成其输出的参数(forward 的返回值)。您可以通过将关键字参数_find_unused_parameters=True_传递给 torch.nn.parallel.DistributedDataParallel 来启用未使用参数检测。如果您已经设置了此参数,则分布式数据并行模块无法在模块的 forward 函数的返回值中定位输出张量。报告此问题时请包括您的模块前向返回值的结构(例如列表、字典、可迭代)
使固定:
从:find_unused_parameters = cfg.get('find_unused_parameters', False) 到:find_unused_parameters = cfg.get('find_unused_parameters', True)
good thanks
I try to train a model using UTF101 dataset. My GPU is GTX 2060S,
and try to input: ./tools/dist_train_recognizer.sh configs/TSN/ucf101/tsn_rgb_bninception.py 1--validate
NOT work
2020-04-03 15:40:26,326 - INFO - Distributed training: True 2020-04-03 15:40:26,506 - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
2020-04-03 15:40:29,750 - INFO - Start running, host: ethan@ethan-ideacentre-Y700-34ISH, work_dir: /media/ethan/D/Project/mmaction/work_dirs/tsn_2d_rgb_bninception_seg_3_f1s1_b32_g8 2020-04-03 15:40:29,750 - INFO - workflow: [('train', 1)], max: 80 epochs Traceback (most recent call last): File "./tools/train_recognizer.py", line 90, in
main()
File "./tools/train_recognizer.py", line 86, in main
logger=logger)
File "/media/ethan/D/Project/mmaction/mmaction/apis/train.py", line 58, in train_network
_dist_train(model, dataset, cfg, validate=validate)
File "/media/ethan/D/Project/mmaction/mmaction/apis/train.py", line 103, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/mmcv-0.4.2-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 359, in run
epoch_runner(data_loaders[i], kwargs)
File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/mmcv-0.4.2-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 263, in train
self.model, data_batch, train_mode=True, kwargs)
File "/media/ethan/D/Project/mmaction/mmaction/apis/train.py", line 37, in batch_processor
losses = model(*data)
File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(input, **kwargs)
File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 392, in forward
self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of > const&) + 0x5ff (0x7f26efe3dbbf in /home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #2: + 0x6cb6c8 (0x7f26efe336c8 in /home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x12d07a (0x7f26ef89507a in /home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: _PyMethodDef_RawFastCallKeywords + 0x264 (0x5578a8593ab4 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #5: _PyCFunction_FastCallKeywords + 0x21 (0x5578a8593bd1 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #6: _PyEval_EvalFrameDefault + 0x5389 (0x5578a85faa39 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #7: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #8: _PyFunction_FastCallDict + 0x3ff (0x5578a85406ef in /home/ethan/anaconda3/envs/torch/bin/python)
frame #9: _PyObject_Call_Prepend + 0x63 (0x5578a855fa73 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #10: PyObject_Call + 0x6e (0x5578a8551fde in /home/ethan/anaconda3/envs/torch/bin/python)
frame #11: _PyEval_EvalFrameDefault + 0x1e9d (0x5578a85f754d in /home/ethan/anaconda3/envs/torch/bin/python)
frame #12: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #13: _PyFunction_FastCallDict + 0x3ff (0x5578a85406ef in /home/ethan/anaconda3/envs/torch/bin/python)
frame #14: _PyObject_Call_Prepend + 0x63 (0x5578a855fa73 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #15: + 0x17d27a (0x5578a85a727a in /home/ethan/anaconda3/envs/torch/bin/python)
frame #16: PyObject_Call + 0x6e (0x5578a8551fde in /home/ethan/anaconda3/envs/torch/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x1e9d (0x5578a85f754d in /home/ethan/anaconda3/envs/torch/bin/python)
frame #18: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #19: _PyFunction_FastCallDict + 0x3ff (0x5578a85406ef in /home/ethan/anaconda3/envs/torch/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x1e9d (0x5578a85f754d in /home/ethan/anaconda3/envs/torch/bin/python)
frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #22: _PyFunction_FastCallDict + 0x1d5 (0x5578a85404c5 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #23: _PyObject_Call_Prepend + 0x63 (0x5578a855fa73 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #24: PyObject_Call + 0x6e (0x5578a8551fde in /home/ethan/anaconda3/envs/torch/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x1e9d (0x5578a85f754d in /home/ethan/anaconda3/envs/torch/bin/python)
frame #26: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #27: _PyFunction_FastCallKeywords + 0x387 (0x5578a85932b7 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x690 (0x5578a85f5d40 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #29: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #30: _PyFunction_FastCallKeywords + 0x387 (0x5578a85932b7 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x14d4 (0x5578a85f6b84 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #32: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #33: _PyFunction_FastCallKeywords + 0x387 (0x5578a85932b7 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x14d4 (0x5578a85f6b84 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #35: _PyFunction_FastCallKeywords + 0xfb (0x5578a859302b in /home/ethan/anaconda3/envs/torch/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x416 (0x5578a85f5ac6 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #37: _PyEval_EvalCodeWithName + 0x2f9 (0x5578a853f389 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #38: PyEval_EvalCodeEx + 0x44 (0x5578a85402b4 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #39: PyEval_EvalCode + 0x1c (0x5578a85402dc in /home/ethan/anaconda3/envs/torch/bin/python)
frame #40: + 0x22c664 (0x5578a8656664 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #41: PyRun_FileExFlags + 0xa1 (0x5578a8660a91 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #42: PyRun_SimpleFileExFlags + 0x1c3 (0x5578a8660c83 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #43: + 0x237db5 (0x5578a8661db5 in /home/ethan/anaconda3/envs/torch/bin/python)
frame #44: _Py_UnixMain + 0x3c (0x5578a8661edc in /home/ethan/anaconda3/envs/torch/bin/python)
frame #45: __libc_start_main + 0xe7 (0x7f26fee80b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #46: + 0x1db3e0 (0x5578a86053e0 in /home/ethan/anaconda3/envs/torch/bin/python)
forward
). You can enable unused parameter detection by passing the keyword argumentfind_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforward
function. Please include the structure of the return value offorward
of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /opt/conda/conda-bld/pytorch_1556653114079/work/torch/csrc/distributed/c10d/reducer.cpp:408) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f26c088bdc5 in /home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocatorTraceback (most recent call last): File "/home/ethan/anaconda3/envs/torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ethan/anaconda3/envs/torch/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/home/ethan/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/home/ethan/anaconda3/envs/torch/bin/python', '-u', './tools/train_recognizer.py', '--local_rank=0', 'configs/TSN/ucf101/tsn_rgb_bninception.py', '--launcher', 'pytorch']' returned non-zero exit status 1.
my pytorch version is 1.1.0 any solution?