microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.7k stars 230 forks source link

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: #5

Closed Ema1997 closed 4 years ago

Ema1997 commented 4 years ago

I encountered with a runtime error when I tried to search for an architecture based on your code.

/opt/conda/conda-bld/pytorch_1565272279342/work/torch/csrc/autograd/python_anomaly_mode.cpp:57: UserWarning: Traceback of forward call that caused the error:
  File "tools/train.py", line 300, in <module>
    main()
  File "tools/train.py", line 259, in main
    est=model_est, local_rank=args.local_rank)
  File "/opt/tiger/cream/lib/core/train.py", line 55, in train_epoch
    output = model(input, random_cand)
  File "/home/tiger/.conda/envs/Cream/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tiger/.conda/envs/Cream/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/tiger/.conda/envs/Cream/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/cream/lib/models/structures/supernet.py", line 121, in forward
    x = self.forward_features(x, architecture)
  File "/opt/tiger/cream/lib/models/structures/supernet.py", line 113, in forward_features
    x = blocks[arch](x)
  File "/home/tiger/.conda/envs/Cream/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tiger/.conda/envs/Cream/lib/python3.6/site-packages/timm/models/efficientnet_blocks.py", line 133, in forward
    x = self.bn1(x)
  File "/home/tiger/.conda/envs/Cream/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tiger/.conda/envs/Cream/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward
    exponential_average_factor, self.eps)
  File "/home/tiger/.conda/envs/Cream/lib/python3.6/site-packages/torch/nn/functional.py", line 1656, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled

Traceback (most recent call last):
  File "tools/train.py", line 300, in <module>
    main()
  File "tools/train.py", line 259, in main
    est=model_est, local_rank=args.local_rank)
  File "/opt/tiger/cream/lib/core/train.py", line 67, in train_epoch
    loss.backward()
  File "/home/tiger/.conda/envs/Cream/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/tiger/.conda/envs/Cream/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [320]] is at version 2507; expected version 2506 instead. Hint: the backtr
ace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I tried to locate the source of the error, and I find that whenever the code update the meta network or add the kd_loss to the final loss the error above appears. How can I fix this problem?

Z7zuqer commented 4 years ago

Hi,

Thanks for your kind question!

Could you share the environment variables with us? We haven't encounter such problems using this code base.

Really appreciate that you could reply.

Best, Hao.

Ema1997 commented 4 years ago

Package Version future 0.18.2 numpy 1.17.0 opencv-python 4.0.1.24 Pillow 6.1.0 ptflops 0.6.2 tensorboard 2.3.0 tensorboard-plugin-wit 1.7.0 tensorboardX 1.2 thop 0.0.31.post2005241907 timm 0.1.20 torch 1.2.0 torchvision 0.2.1 yacs 0.1.8

thank you very much

macn3388 commented 4 years ago

same problem.

Z7zuqer commented 4 years ago

Hi,

We have carefully checked the source codes and environments, this bug is from torch.dist.distributed. We thought apex was not required before. However, due to the implemention in torch DDP, we could not train the supernet in SPOS mechaism.

Thus, to solve this bug, it's necessary to run over apex package. You should install apex before supernet training. We would fix installation steps in README.md.

Thanks. Hao.

penghouwen commented 4 years ago

@Ema1997 @macn3388 Would you check whether the issue has been solved? Thanks.

cswaynecool commented 3 years ago

The same error occurs, when using apex.

cswaynecool commented 3 years ago

Adding "for name, param in model.named_parameters(recurse=True): param.grad = None" at the beginning of update_student_weights_only solves my problem. It is caused by optimizer.step(), which changes the parameters of meta network.

penghouwen commented 3 years ago

Adding "for name, param in model.named_parameters(recurse=True): param.grad = None" at the beginning of update_student_weights_only solves my problem. It is caused by optimizer.step(), which changes the parameters of meta network.

In our experience, if the installation strictly follows the README, this issue should not occur.

Z7zuqer commented 3 years ago

Adding "for name, param in model.named_parameters(recurse=True): param.grad = None" at the beginning of update_student_weights_only solves my problem. It is caused by optimizer.step(), which changes the parameters of meta network.

HI,

Could you share your environment variables with us?

We have tested the codes. When using apex(installed following REAME), it should not occur.

Best, Hao.

jonsnows commented 3 years ago

Adding "for name, param in model.named_parameters(recurse=True): param.grad = None" at the beginning of update_student_weights_only solves my problem. It is caused by optimizer.step(), which changes the parameters of meta network.

hello i want to ask where you add the code? i ocuur the same problem after i have installed apex using pip.

Z7zuqer commented 3 years ago

Adding "for name, param in model.named_parameters(recurse=True): param.grad = None" at the beginning of update_student_weights_only solves my problem. It is caused by optimizer.step(), which changes the parameters of meta network.

hello i want to ask where you add the code? i ocuur the same problem after i have installed apex using pip.

Hi,

You should install apex with cpp extension and cuda extension as indicated in this URL

python ./apex/setup.py install --cpp_ext --cuda_ext

Or you could add the above codes as SPOS did: Set the grad to None in each training iteration.

Best, Hao.