joffery / M-ADA

The Pytorch implementation for "Learning to Learn Single Domain Generalization" (CVPR 2020)
https://arxiv.org/abs/2003.13216
143 stars 25 forks source link

RuntimeError in backward pass due to in-place operation #19

Open muhammadsmalik opened 5 months ago

muhammadsmalik commented 5 months ago

After modifying this line to include "retain_graph": D_loss.backward(retain_graph=True)

I now keep getting this error: python .\main_Digits.py Pre-train wae Loading MNIST dataset. Loading MNIST dataset. Warning: Error detected in AddmmBackward. Traceback of forward call that caused the error: File ".\main_Digits.py", line 321, in main() File ".\main_Digits.py", line 83, in main train(model, exp_name, kwargs) File ".\main_Digits.py", line 97, in train wae_train(wae, discriminator, train_loader, wae_optimizer, d_optimizer, epoch) File ".\main_Digits.py", line 290, in wae_train D_z_tilde = D(z_tilde.clone()) File "C:\Users\admin\anaconda3\envs\M-ADA\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(*input, kwargs) File "C:\Users\admin\Desktop\MPhil\Domain_Generalization\models\ada_conv.py", line 66, in forward return self.net(z) File "C:\Users\admin\anaconda3\envs\M-ADA\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(*input, *kwargs) File "C:\Users\admin\anaconda3\envs\M-ADA\lib\site-packages\torch\nn\modules\container.py", line 100, in forward input = module(input) File "C:\Users\admin\anaconda3\envs\M-ADA\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(input, kwargs) File "C:\Users\admin\anaconda3\envs\M-ADA\lib\site-packages\torch\nn\modules\linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "C:\Users\admin\anaconda3\envs\M-ADA\lib\site-packages\torch\nn\functional.py", line 1610, in linear ret = torch.addmm(bias, input, weight.t()) (print_stack at ..\torch\csrc\autograd\python_anomaly_mode.cpp:60) Traceback (most recent call last): File ".\main_Digits.py", line 321, in main() File ".\main_Digits.py", line 83, in main train(model, exp_name, kwargs) File ".\main_Digits.py", line 97, in train wae_train(wae, discriminator, train_loader, wae_optimizer, d_optimizer, epoch) File ".\main_Digits.py", line 306, in wae_train loss.backward() # No need to retain the graph here if this is the final use of it File "C:\Users\admin\anaconda3\envs\M-ADA\lib\site-packages\torch\tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "C:\Users\admin\anaconda3\envs\M-ADA\lib\site-packages\torch\autograd__init__.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

lisa-lthorrold commented 1 month ago

You might have to shuffle the d_optimzer until after loss.backward() has been called in the wae training loop. This was running on an older version of pytorch where the error was not thrown, but the gradients calculated based on the sequence of operations in the existing code is not correct.

Basically, it seems the model parameters are updated, after d_optimiser.step() has been called. Because the model parameters is updated, the subsequent loss.backward() call is not working correctly - this is the inplace update the error message is referring to. More information about the error here: