Open StephanPan opened 3 years ago
Hi @StephanPan have you solved the issue? I think the problem is this line. But I don't know how to rewrite it.
@axhiao i change the loss calculation in function.py as follows, and it worked, but i do not know whether it will influence the model performance. optimizer.zero_grad() if loss_cord > 0: (loss_2d + loss_cord).backward() if loss_3d > 0 and (i + 1) % accumulation_steps == 0: loss_3d.backward() optimizer.step()
hi @StephanPan I think it's due to different pytorch version. I recommend you use the requirements.txt to create a fully new virtual python env to run this codes.
@axhiao that's right, but my cuda version and gpu driver is not corresponding to the torch1.4
I'm in the same error too...
@StephanPan, hi, you are right, the problem is in the backward step, you can change the code in function.py as follows
loss = loss_2d + loss_3d + loss_cord loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
@StephanPan Hi Do you know what is loss cord?
@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows
loss = loss_2d + loss_3d + loss_cord loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
How exactly do you change the code ???
loss = loss_2d + loss_3d + loss_cord
losses.update(loss.item())
if loss_cord > 0:
optimizer.zero_grad()
(loss_2d + loss_cord).backward()
optimizer.step()
if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0:
optimizer.zero_grad()
accu_loss_3d.backward()
optimizer.step()
accu_loss_3d = 0.0
else:
accu_loss_3d += loss_3d / accumulation_steps
@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows loss = loss_2d + loss_3d + loss_cord loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
How exactly do you change the code ???
loss = loss_2d + loss_3d + loss_cord losses.update(loss.item()) if loss_cord > 0: optimizer.zero_grad() (loss_2d + loss_cord).backward() optimizer.step() if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0: optimizer.zero_grad() accu_loss_3d.backward() optimizer.step() accu_loss_3d = 0.0 else: accu_loss_3d += loss_3d / accumulation_steps
This is how I changed it, it works for me:
loss_2d = loss_2d.mean()
loss_3d = loss_3d.mean()
loss_cord = loss_cord.mean()
losses_2d.update(loss_2d.item())
losses_3d.update(loss_3d.item())
losses_cord.update(loss_cord.item())
loss = loss_2d + loss_3d + loss_cord
losses.update(loss.item())
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
# if loss_cord > 0:
# optimizer.zero_grad()
# (loss_2d + loss_cord).backward()
# optimizer.step()
# if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0
# optimizer.step()
# optimizer.zero_grad()
# accu_loss_3d.backward()
# accu_loss_3d = 0.0
# else:
# accu_loss_3d += loss_3d / accumulation_steps
batch_time.update(time.time() - end)
end = time.time()
Try to change torch version to 1.4, it should be ok. :)
@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows loss = loss_2d + loss_3d + loss_cord loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
How exactly do you change the code ???
loss = loss_2d + loss_3d + loss_cord losses.update(loss.item()) if loss_cord > 0: optimizer.zero_grad() (loss_2d + loss_cord).backward() optimizer.step() if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0: optimizer.zero_grad() accu_loss_3d.backward() optimizer.step() accu_loss_3d = 0.0 else: accu_loss_3d += loss_3d / accumulation_steps
This is how I changed it, it works for me:
loss_2d = loss_2d.mean() loss_3d = loss_3d.mean() loss_cord = loss_cord.mean() losses_2d.update(loss_2d.item()) losses_3d.update(loss_3d.item()) losses_cord.update(loss_cord.item()) loss = loss_2d + loss_3d + loss_cord losses.update(loss.item()) loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad() # if loss_cord > 0: # optimizer.zero_grad() # (loss_2d + loss_cord).backward() # optimizer.step() # if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0 # optimizer.step() # optimizer.zero_grad() # accu_loss_3d.backward() # accu_loss_3d = 0.0 # else: # accu_loss_3d += loss_3d / accumulation_steps batch_time.update(time.time() - end) end = time.time()
The change also works for me, but I don't know whether it will affect the precison of the result, can you give some explanation? Thanks!
same question
when i trained the model on campus datasets and met such problem. and i use the torch1.7, cuda 11.1. And the training strategy in the code seems be different from the strategy given in the paper. Traceback (most recent call last): File "run/train_3d.py", line 163, in
main()
File "run/train_3d.py", line 136, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "/home/gw/Project/voxelpose/lib/core/function.py", line 68, in train_3d
accu_loss_3d.backward()
File "/home/gw/anaconda3/envs/VIBE/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/gw/anaconda3/envs/VIBE/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 32, 1, 1, 1]] is at version 8; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).