saic-violet / bilayer-model

Mozilla Public License 2.0
245 stars 49 forks source link

'DistributedDataParallel' object has no attribute 'callback_queued' #8

Closed daili0015 closed 4 years ago

daili0015 commented 4 years ago

Traceback (most recent call last): File "train.py", line 422, in <module> nets = m.train(args) File "train.py", line 337, in train loss = model(data_dict) File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/parallel/distributed.py", line 560, in forward result = self.module(*inputs, **kwargs) File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/mnt/lustre/zhengchengyao/workspace/bilayer-model-master/runners/default.py", line 195, in forward self.data_dict = self.nets[net_name](self.data_dict, networks_to_train, self.nets) File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/mnt/lustre/zhengchengyao/workspace/bilayer-model-master/networks/texture_enhancer.py", line 149, in forward loss_enh.backward() File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/parallel/distributed.py", line 392, in allreduce_hook if not self.callback_queued: File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 585, in __getattr__ type(self).__name__, name)) AttributeError: 'DistributedDataParallel' object has no attribute 'callback_queued'

when train enhancer, I meet this. I found this may be caused by gradient operations—— https://github.com/NVIDIA/apex/issues/107 Have you ever met this problem ?

daili0015 commented 4 years ago

Change here https://github.com/saic-violet/bilayer-model/blob/master/networks/texture_enhancer.py#L147 to loss_enh = self.rgb_loss(pred_source_imgs, source_imgs.detach()) solve the problem.

source_imgs is used as ground-truth, I think it should not be involved with a calculation graph.

egorzakharov commented 4 years ago

Done, thanks

No, I have not encountered this problem. I believe the same code worked for me even with the latest version of Apex, but I committed your proposed changes. If I understand correctly, it should have not made a difference, since source_imgs already has requires_grad=False, but maybe I'm mistaken.