huangmozhilv / u2net_torch

MICCAI2019:3D U$^2$-Net: A 3D Universal U-Net for Multi-Domain Medical Image Segmentation
232 stars 53 forks source link

A Small Bug using resume_ckp and continueing training. #8

Closed mangoyuan closed 4 years ago

mangoyuan commented 4 years ago

I stopped training model when lr was smaller than 3e-7 in before. Then I set the min_lr to 1e-8 and use the resume_ckp and resume_epoch to continue training. But i find the lr is resumed uncorrectly in https://github.com/huangmozhilv/u2net_torch/blob/a1e43b85a2c7bc4a468f0eccc0403b87b08e9a2e/u2net_torch_src/train.py#L214

lr = config.base_lr
if args.resume_ckp != '':
    optimizer = checkpoint['optimizer']
else:
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=config.weight_decay) # 

this can be fixed liked:

lr = config.base_lr
if args.resume_ckp != '':
    optimizer = checkpoint['optimizer']
    lr = optimizer.param_groups[0]['lr]   # resume original learning rate
else:
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=config.weight_decay) # 

At last, thanks for your sharing!

huangmozhilv commented 4 years ago

@mangoyuan Please try avoid resuming any model and run the code till completion every time. This is a bug because the trLoss was computed as exponential moving average of the past several epochs. This makes it difficult to resume the trLoss status as the checkpoint. For the sake of limited time, I did not debug this part. To debug it, you also need to save the values for the past epochs in the last window.

mangoyuan commented 4 years ago

ok, i get it