jiawei-ren / BalancedMetaSoftmax-Classification

[NeurIPS 2020] Balanced Meta-Softmax for Long-Tailed Visual Recognition
https://github.com/jiawei-ren/BalancedMetaSoftmax
Other
135 stars 26 forks source link

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #2

Closed zyongbei closed 3 years ago

zyongbei commented 3 years ago

https://github.com/jiawei-ren/BalancedMetaSoftmax-Classification/blob/c41238d1fa5fd5a27cf2b49d5398d8906bed7813/run_networks.py#L284

Configuration file: ./config/CIFAR10_LT/balms_imba200.yaml The version of Pytorch is 1.7.0 and Higher is 0.2.1

Hello, this error occurred when the meta_forward() function was run for the second time. I carefully review the code, but I don't know where the problem is.
Thankyou~

Traceback (most recent call last): File "main.py", line 161, in training_model.train() File "/data/byz/jupyter/BalancedMetaSoftmax-Classification/run_networks.py", line 360, in train self.meta_forward(inputs, labels, verbose=step % self.training_opt['display_step'] == 0) File "/data/byz/jupyter/BalancedMetaSoftmax-Classification/run_networks.py", line 288, in meta_forward val_loss.backward() File "/home/wangshuogroup/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/wangshuogroup/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [10, 1]], which is output 0 of TBackward, is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

cunjunyu commented 3 years ago

Would you mind providing:

  1. Which dataset you were training on and the configuration file you were using?
  2. The version of Pytorch and Higher you were using

So we can have a look at the problem.

Thank you.

zyongbei commented 3 years ago

Would you mind providing:

  1. Which dataset you were training on and the configuration file you were using?
  2. The version of Pytorch and Higher you were using

So we can have a look at the problem.

Thank you.

Hello, Configuration file: ./config/CIFAR10_LT/balms_imba200.yaml The version of Pytorch is 1.7.0 and Higher is 0.2.1 Thank you ~

jiawei-ren commented 3 years ago

Please downgrade your Pytorch to 1.4 for now, which is the version we used for testing. We will follow up on this issue.

jiawei-ren commented 3 years ago

Setting num_workers=0 in the config will solve the issue for Pytorch>=1.5.

The sample rates used in the backward function are a few iterations earlier than the current learner paramters since the data loader buffers some batches in sub-processes to accelerate training. The mismatch will be raised as a runtime error in Pytorch >= 1.5. Moving data loading to the main process solves the issue, meanwhile slows down training. Since the mismatch will not make a visible difference to the training outcome, using Pytorch<1.5 should be a more ideal option.

cunjunyu commented 3 years ago

I will close the issue for now, feel free to reopen it if you have further questions.