training problem - Githubissues

StephanPan commented 3 years ago

when i trained the model on campus datasets and met such problem. and i use the torch1.7, cuda 11.1. And the training strategy in the code seems be different from the strategy given in the paper. Traceback (most recent call last): File "run/train_3d.py", line 163, in main() File "run/train_3d.py", line 136, in main train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict) File "/home/gw/Project/voxelpose/lib/core/function.py", line 68, in train_3d accu_loss_3d.backward() File "/home/gw/anaconda3/envs/VIBE/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/gw/anaconda3/envs/VIBE/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 32, 1, 1, 1]] is at version 8; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

axhiao commented 3 years ago

Hi @StephanPan have you solved the issue? I think the problem is this line. But I don't know how to rewrite it.

StephanPan commented 3 years ago

@axhiao i change the loss calculation in function.py as follows, and it worked, but i do not know whether it will influence the model performance. optimizer.zero_grad() if loss_cord > 0: (loss_2d + loss_cord).backward() if loss_3d > 0 and (i + 1) % accumulation_steps == 0: loss_3d.backward() optimizer.step()

axhiao commented 3 years ago

hi @StephanPan I think it's due to different pytorch version. I recommend you use the requirements.txt to create a fully new virtual python env to run this codes.

StephanPan commented 3 years ago

@axhiao that's right, but my cuda version and gpu driver is not corresponding to the torch1.4

tamasino52 commented 3 years ago

I'm in the same error too...

wkom commented 3 years ago

@StephanPan, hi, you are right, the problem is in the backward step, you can change the code in function.py as follows

loss = loss_2d + loss_3d + loss_cord loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()

sudo-vinnie commented 3 years ago

@StephanPan Hi Do you know what is loss cord?

SauBuen commented 3 years ago

@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows

loss = loss_2d + loss_3d + loss_cord loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()

How exactly do you change the code ???


        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        if loss_cord > 0:
            optimizer.zero_grad()
            (loss_2d + loss_cord).backward()
            optimizer.step()

        if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0:
            optimizer.zero_grad()
            accu_loss_3d.backward()
            optimizer.step()
            accu_loss_3d = 0.0
        else:
            accu_loss_3d += loss_3d / accumulation_steps

salvador-blanco commented 3 years ago

@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows loss = loss_2d + loss_3d + loss_cord loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()

How exactly do you change the code ???
        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        if loss_cord > 0:
            optimizer.zero_grad()
            (loss_2d + loss_cord).backward()
            optimizer.step()

        if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0:
            optimizer.zero_grad()
            accu_loss_3d.backward()
            optimizer.step()
            accu_loss_3d = 0.0
        else:
            accu_loss_3d += loss_3d / accumulation_steps

This is how I changed it, it works for me:

 loss_2d = loss_2d.mean()
        loss_3d = loss_3d.mean()
        loss_cord = loss_cord.mean()

        losses_2d.update(loss_2d.item())
        losses_3d.update(loss_3d.item())
        losses_cord.update(loss_cord.item())
        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        loss.backward()
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

        # if loss_cord > 0:
        #     optimizer.zero_grad()
        #     (loss_2d + loss_cord).backward()
        #     optimizer.step()

        # if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0
        #     optimizer.step()
        #     optimizer.zero_grad()
        #     accu_loss_3d.backward()
        #     accu_loss_3d = 0.0
        # else:
        #     accu_loss_3d += loss_3d / accumulation_steps

        batch_time.update(time.time() - end)
        end = time.time()

baojunshan commented 2 years ago

Try to change torch version to 1.4, it should be ok. :)

Alex-JYJ commented 2 years ago

@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows loss = loss_2d + loss_3d + loss_cord loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()

How exactly do you change the code ???
        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        if loss_cord > 0:
            optimizer.zero_grad()
            (loss_2d + loss_cord).backward()
            optimizer.step()

        if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0:
            optimizer.zero_grad()
            accu_loss_3d.backward()
            optimizer.step()
            accu_loss_3d = 0.0
        else:
            accu_loss_3d += loss_3d / accumulation_steps

This is how I changed it, it works for me:

 loss_2d = loss_2d.mean()
        loss_3d = loss_3d.mean()
        loss_cord = loss_cord.mean()

        losses_2d.update(loss_2d.item())
        losses_3d.update(loss_3d.item())
        losses_cord.update(loss_cord.item())
        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        loss.backward()
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

        # if loss_cord > 0:
        #     optimizer.zero_grad()
        #     (loss_2d + loss_cord).backward()
        #     optimizer.step()

        # if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0
        #     optimizer.step()
        #     optimizer.zero_grad()
        #     accu_loss_3d.backward()
        #     accu_loss_3d = 0.0
        # else:
        #     accu_loss_3d += loss_3d / accumulation_steps

        batch_time.update(time.time() - end)
        end = time.time()

The change also works for me, but I don't know whether it will affect the precison of the result, can you give some explanation? Thanks!

cucdengjunli commented 1 year ago

same question

microsoft / voxelpose-pytorch

training problem #19