Closed solidexu closed 1 year ago
@solidexu I don't have spare gpu right now. I will try test it tomorrow.
I don't know why, the issue is solved by commenting out the relu in RESAReducer. It's too STRANGE for me.
I don't know why, the issue is solved by commenting out the relu in RESAReducer. It's too STRANGE for me.
What pytorch version are you using & do you experience this with/without mixed precision?
I use torch 1.10.2. And I have tested with/without mixed precision, same issue.
@solidexu I don't really have 1.10, but I can start training normally with 1.6.0 (I have only one card so I first changed world_size to 1 and then use only bs 2).
Here is my command:
python main_landet.py --train --config=./configs/lane_detection/resa/resnet50_culane.py --mixed-precision --batch-size=2
Are you running customized code or do you see that error in the current master branch?
In current master branch, I download your new branch three days ago in fact. Commenting out the relu also occur another error during training. I think I can try torch 1.6.0
try to add a 1*1 conv at the top layer of RESA, it may be helpful
@solidexu Sorry to disturb, but did you solve this issue by down-grading pytorch? I think it is encountered by others as well.
close by #121
when I tested RESA with resnet50, error occurred. Then I tested SCNN resnet50, same issue
python main_landet.py --train --config=./configs/lane_detection/resa/resnet50_culane.py --mixed-precision Loaded torchvision ImageNet pre-trained weights V1. Not using distributed mode cuda Traceback (most recent call last): File "main_landet.py", line 65, in
runner.run()
File "/home/aaa/pytorch-auto-drive-master/utils/runners/lane_det_trainer.py", line 55, in run
scaler.scale(loss).backward()
File "/home/aaa/.local/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/aaa/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [5, 128, 36, 100]], which is output 0 of ReluBackward0, is at version 20; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).