A Small Bug using resume_ckp and continueing training.

mangoyuan commented 4 years ago

I stopped training model when lr was smaller than 3e-7 in before. Then I set the min_lr to 1e-8 and use the resume_ckp and resume_epoch to continue training. But i find the lr is resumed uncorrectly in https://github.com/huangmozhilv/u2net_torch/blob/a1e43b85a2c7bc4a468f0eccc0403b87b08e9a2e/u2net_torch_src/train.py#L214

lr = config.base_lr
if args.resume_ckp != '':
    optimizer = checkpoint['optimizer']
else:
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=config.weight_decay) #

this can be fixed liked:

lr = config.base_lr
if args.resume_ckp != '':
    optimizer = checkpoint['optimizer']
    lr = optimizer.param_groups[0]['lr]   # resume original learning rate
else:
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=config.weight_decay) #

At last, thanks for your sharing!

huangmozhilv commented 4 years ago

@mangoyuan Please try avoid resuming any model and run the code till completion every time. This is a bug because the trLoss was computed as exponential moving average of the past several epochs. This makes it difficult to resume the trLoss status as the checkpoint. For the sake of limited time, I did not debug this part. To debug it, you also need to save the values for the past epochs in the last window.

mangoyuan commented 4 years ago

ok, i get it

huangmozhilv / u2net_torch

A Small Bug using resume_ckp and continueing training. #8