Open hust-lidelong opened 2 years ago
参考MMDet的文档 doc, 可以试试在config文件中增加
find_unused_parameters = True
感谢您的回复,我尝试添加了find_unused_parameters = True
,但是训练速度会变慢(MMDet的文档也写了:but this will slow down the training speed),请问有其他方法吗
@hust-lidelong 这个是因为or_pooling中定义了一组可学习的参数,但是没有使用,我在最新的commit中将这组参数注解了,现在应该可以直接分布式训练了。
@hust-lidelong 这个是因为or_pooling中定义了一组可学习的参数,但是没有使用,我在最新的commit中将这组参数注解了,现在应该可以直接分布式训练了。
现在可以分布式训练了,给您点赞
您好,当我运行 S2ANet,单卡训练是正常的,但是多卡训练就报错:
Traceback (most recent call last): File "/home/lidelong/data/code/OBBDetection/./tools/train.py", line 162, in
main()
File "/home/lidelong/data/code/OBBDetection/./tools/train.py", line 151, in main
train_detector(
File "/home/lidelong/data/code/OBBDetection/mmdet/apis/train.py", line 136, in train_detector
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/lidelong/miniconda3/envs/obbdet/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], kwargs)
File "/home/lidelong/miniconda3/envs/obbdet/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, kwargs)
File "/home/lidelong/miniconda3/envs/obbdet/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/home/lidelong/miniconda3/envs/obbdet/lib/python3.9/site-packages/mmcv/parallel/distributed.py", line 42, in train_step
and self.reducer._rebuild_buckets()):
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
; (2) making sure allforward
function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforward
function. Please include the loss function and the structure of the return value offorward
of your module when reporting this issue (e.g. list, dict, iterable).