Open TianhaoFu opened 2 years ago
you can set find_unused_parameters to false
when i set false, i came across such error
Traceback (most recent call last):
File "./tools/train.py", line 181, in <module>
main()
File "./tools/train.py", line 176, in main
logger=logger,
File "/centerpoint/centerpoint/det3d/torchie/apis/train.py", line 326, in train_detector
trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
File "/centerpoint/centerpoint/det3d/torchie/trainer/trainer.py", line 545, in run
epoch_runner(data_loaders[i], self.epoch, **kwargs)
File "/centerpoint/centerpoint//det3d/torchie/trainer/trainer.py", line 412, in train
self.model, data_batch, train_mode=True, **kwargs
File "/centerpoint/centerpoint/det3d/torchie/trainer/trainer.py", line 367, in batch_processor_inline
losses = model(example, return_loss=True)
File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 903, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pai/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 714, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
@tianweiy
torch version ?
i can't find this line in my torch
File "/home/pai/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 735, in forward assert(self.find_unused_parameters == False)
my torch version is 1.8 and my cuda version is 11.0
thanks:)
i get multiple phd interviews this week and won't be able to check until next monday. Maybe you can first debug it a bit
Maybe you need to use slurm runner ? Multimachine is a bit complicated. The original command may not work. Sorry I don't have any experience with this. I can ask around if you haven't fixed the issue
hi, when i was use torch.ddp on 8 a10 gpu machines(and machine has one gpu.) i came across such error:
my running command is
I have not changed any code, only the running command.
why this bug happen? Could you tell me which module of the model output does not participate in the calculation of the loss, so I can check the cause of the error thanks!