voldemortX / pytorch-auto-drive

PytorchAutoDrive: Segmentation models (ERFNet, ENet, DeepLab, FCN...) and Lane detection models (SCNN, RESA, LSTR, LaneATT, BézierLaneNet...) based on PyTorch with fast training, visualization, benchmarking & deployment help
BSD 3-Clause "New" or "Revised" License
837 stars 137 forks source link

inplace operation error #81

Closed solidexu closed 1 year ago

solidexu commented 2 years ago

when I tested RESA with resnet50, error occurred. Then I tested SCNN resnet50, same issue

python main_landet.py --train --config=./configs/lane_detection/resa/resnet50_culane.py --mixed-precision Loaded torchvision ImageNet pre-trained weights V1. Not using distributed mode cuda Traceback (most recent call last): File "main_landet.py", line 65, in runner.run() File "/home/aaa/pytorch-auto-drive-master/utils/runners/lane_det_trainer.py", line 55, in run scaler.scale(loss).backward() File "/home/aaa/.local/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/aaa/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [5, 128, 36, 100]], which is output 0 of ReluBackward0, is at version 20; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

voldemortX commented 2 years ago

@solidexu I don't have spare gpu right now. I will try test it tomorrow.

solidexu commented 2 years ago

I don't know why, the issue is solved by commenting out the relu in RESAReducer. It's too STRANGE for me. image

voldemortX commented 2 years ago

I don't know why, the issue is solved by commenting out the relu in RESAReducer. It's too STRANGE for me. image

What pytorch version are you using & do you experience this with/without mixed precision?

solidexu commented 2 years ago

I use torch 1.10.2. And I have tested with/without mixed precision, same issue.

voldemortX commented 2 years ago

@solidexu I don't really have 1.10, but I can start training normally with 1.6.0 (I have only one card so I first changed world_size to 1 and then use only bs 2).

Here is my command:

python main_landet.py --train --config=./configs/lane_detection/resa/resnet50_culane.py --mixed-precision --batch-size=2
voldemortX commented 2 years ago

Are you running customized code or do you see that error in the current master branch?

solidexu commented 2 years ago

In current master branch, I download your new branch three days ago in fact. Commenting out the relu also occur another error during training. I think I can try torch 1.6.0

solidexu commented 2 years ago

try to add a 1*1 conv at the top layer of RESA, it may be helpful

voldemortX commented 1 year ago

@solidexu Sorry to disturb, but did you solve this issue by down-grading pytorch? I think it is encountered by others as well.

voldemortX commented 1 year ago

close by #121