pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.93k stars 22.36k forks source link

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #92807

Closed de-gozaru closed 1 year ago

de-gozaru commented 1 year ago

🐛 Describe the bug

Hi,

I want to use torch.linalg.svd inside my network.

The structure of my code is the following:

torch.autograd.set_detect_anomaly(True)
[...]

def forward(self, data):
  input = data.x
  # input.is_leaf == True
  output = self.network(input)
  # output.is_leaf == False
  covariance = self.get_covariance(output)
  U, _, VT = torch.linalg.svd(covariance)
  # raise RuntimeError: one of the variables needed for gradient computation ...

def get_covariance(self, output):
  # create a sparse matrix using `output`, and convert it back to dense later
  # sparse matrix created using: sparse_matrix = torch.sparse.FloatTensor(i, v, torch.Size(shape))
  return covariance

and I get the following error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [5000, 3, 3]], which is output 0 of LinalgSvdBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I checked inside my self.network and I could not find any in-place operation.

Can you please provide me with some directions to debug this? is this related with the sparse matrix creation? Why output become a non-leaf even if my network is just an addition operation?

Versions

Collecting environment information... PyTorch version: 1.12.1+cu116 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.20.0 Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.7.99 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti Nvidia driver version: 515.65.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] msgpack-numpy==0.4.8 [pip3] mypy==0.971 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.22.0 [pip3] pytorch-metric-learning==1.5.2 [pip3] torch==1.12.1+cu116 [pip3] torch-cluster==1.6.0 [pip3] torch-geometric==2.0.4 [pip3] torch-scatter==2.0.9 [pip3] torch-sparse==0.6.15 [pip3] torch-spline-conv==1.2.1 [pip3] torchaudio==0.12.1+cu116 [pip3] torchdrift==0.1.0.post1 [pip3] torchfile==0.1.0 [pip3] torchmetrics==0.9.3 [pip3] torchnet==0.0.4 [pip3] torchvision==0.13.1+cu116

cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @Lezcano

lezcano commented 1 year ago

Could you try to provide a (small if possible) self-contained repro?

de-gozaru commented 1 year ago

Hi,

I found the source of the error, It was something like matrix[:, :10] = matrix[:, :10].copy() * 10 and this operation is considered inline I think