Closed daili0015 closed 4 years ago
Change here https://github.com/saic-violet/bilayer-model/blob/master/networks/texture_enhancer.py#L147
to
loss_enh = self.rgb_loss(pred_source_imgs, source_imgs.detach())
solve the problem.
source_imgs is used as ground-truth, I think it should not be involved with a calculation graph.
Done, thanks
No, I have not encountered this problem. I believe the same code worked for me even with the latest version of Apex, but I committed your proposed changes. If I understand correctly, it should have not made a difference, since source_imgs already has requires_grad=False, but maybe I'm mistaken.
Traceback (most recent call last): File "train.py", line 422, in <module> nets = m.train(args) File "train.py", line 337, in train loss = model(data_dict) File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/parallel/distributed.py", line 560, in forward result = self.module(*inputs, **kwargs) File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/mnt/lustre/zhengchengyao/workspace/bilayer-model-master/runners/default.py", line 195, in forward self.data_dict = self.nets[net_name](self.data_dict, networks_to_train, self.nets) File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/mnt/lustre/zhengchengyao/workspace/bilayer-model-master/networks/texture_enhancer.py", line 149, in forward loss_enh.backward() File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/parallel/distributed.py", line 392, in allreduce_hook if not self.callback_queued: File "/mnt/lustre/zhengchengyao/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 585, in __getattr__ type(self).__name__, name)) AttributeError: 'DistributedDataParallel' object has no attribute 'callback_queued'
when train enhancer, I meet this. I found this may be caused by gradient operations—— https://github.com/NVIDIA/apex/issues/107 Have you ever met this problem ?