tianzhi0549 / FCOS

FCOS: Fully Convolutional One-Stage Object Detection (ICCV'19)
https://arxiv.org/abs/1904.01355
Other
3.28k stars 630 forks source link

Bug for defined but unused convolution layer #217

Closed fanq15 closed 4 years ago

fanq15 commented 4 years ago

I have a very weird bug. If I define a convolution layer in the FCOS head but I do not use this layer, the code will give me this error. I never met this bug before in PyTorch. I wonder whether it because the distribute training, all the defined layers must be used in the forward function? I know how to deal with this bug, but because it is weird, so I want to ask if you have encountered this bug. Thank you!

Traceback (most recent call last):
  File "tools/train_net.py", line 183, in <module>
    main()
  File "tools/train_net.py", line 176, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 82, in train
    arguments,
  File "/data/qfanaa/code/fcos/yolact/fcos_core/engine/trainer.py", line 79, in do_train
    losses.backward()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]
tianzhi0549 commented 4 years ago

@fanq15 It is normal. Please upgrade to the latest PyTorch and set find_unused_parameters = True for DistributedDataParallel. Check out the document here https://pytorch.org/docs/stable/_modules/torch/nn/parallel/distributed.html.

fanq15 commented 4 years ago

Thank you very much!