RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

maosuli commented 2 years ago

Hello. Thanks for the excellent work.

I found a problem when I trained the model on the S3DIS dataset with pytorch 1.9.1 + cuda 11.1.

An error occured during the bakward process. It seems one of the variables has been modified before the gradient calculation.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [12806, 32, 32]], which is output 0 of AsStridedBackward0, is at version 1; expected version 0 instead.

The same code could be used within the environment of pytorch 1.3 plus cuda 10.1 without the mentioned error.

But maybe it is because of the incomplete error triggering mechanism? Inplace operation may lead to a wrong gradient calculation.

Detailed error info is as follows.

[W python_anomaly_mode.cpp:104] Warning: Error detected in GraphConvFunctionBackward. Traceback of forward call that caused the error: File "/model/SuperPointGraph1/superpoint_graph-ssp-spg/learning/main.py", line 461, in main() File "/model/SuperPointGraph1/superpoint_graph-ssp-spg/learning/main.py", line 331, in main acc, loss, oacc, avg_iou = train() File "/model/SuperPointGraph1/superpoint_graph-ssp-spg/learning/main.py", line 205, in train outputs = model.ecc(embeddings) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/model/SuperPointGraph1/superpoint_graph-ssp-spg/learning/../learning/graphnet.py", line 97, in forward input = module(input) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, *kwargs) File "/model/SuperPointGraph1/superpoint_graph-ssp-spg/learning/../learning/modules.py", line 176, in forward self._edge_mem_limit) (function _print_stack) 0%| | 0/78 [00:07<?, ?it/s] Traceback (most recent call last): File "/model/SuperPointGraph1/superpoint_graph-ssp-spg/learning/main.py", line 461, in main() File "/model/SuperPointGraph1/superpoint_graph-ssp-spg/learning/main.py", line 331, in main acc, loss, oacc, avg_iou = train() File "/model/SuperPointGraph1/superpoint_graph-ssp-spg/learning/main.py", line 209, in train loss.backward() File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py", line 156, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag File "/opt/conda/lib/python3.7/site-packages/torch/autograd/function.py", line 199, in apply return user_fn(self, args) File "/model/SuperPointGraph1/superpoint_graph-ssp-spg/learning/../learning/ecc/GraphConvModule.py", line 98, in backward input, weights = ctx.saved_tensors RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [12806, 32, 32]], which is output 0 of AsStridedBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later.

I appreciate you can give me some suggestions and help.

Thanks,

Eric.

gihunsong commented 1 year ago

Same with @LuZaiJiaoXiaL here. @LuZaiJiaoXiaL Did you solve the problem by the way?

loicland commented 1 year ago

what is your pytorch / cuda version?

gihunsong commented 1 year ago

@loicland recently tried on torch 1.12.1 & 1.13.1 + cu113, 117 I may need to change the version of it.

jing-zhao9 commented 1 year ago

@gihunsong I have the same problem. Did you solve the problem by the way?

loicland commented 1 year ago

Hi!

We are releasing a new version of SuperPoint Graph called SuperPoint Transformer (SPT). It is better in any way:

✨ SPT in numbers ✨
📊 SOTA results: 76.0 mIoU S3DIS 6-Fold, 63.5 mIoU on KITTI-360 Val, 79.6 mIoU on DALES
🦋 212k parameters only!
⚡ Trains on S3DIS in 3h on 1 GPU
⚡ Preprocessing is x7 faster than SPG!
🚀 Easy install (no more boost!)

If you are interested in lightweight, high-performance 3D deep learning, you should check it out. In the meantime, we will finally retire SPG and stop maintaining this repo.

loicland / superpoint_graph

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #270