Closed mangoyuan closed 4 years ago
@mangoyuan Please try avoid resuming any model and run the code till completion every time. This is a bug because the trLoss was computed as exponential moving average of the past several epochs. This makes it difficult to resume the trLoss status as the checkpoint. For the sake of limited time, I did not debug this part. To debug it, you also need to save the values for the past epochs in the last window.
ok, i get it
I stopped training model when
lr
was smaller than 3e-7 in before. Then I set themin_lr
to 1e-8 and use theresume_ckp
andresume_epoch
to continue training. But i find thelr
is resumed uncorrectly in https://github.com/huangmozhilv/u2net_torch/blob/a1e43b85a2c7bc4a468f0eccc0403b87b08e9a2e/u2net_torch_src/train.py#L214this can be fixed liked:
At last, thanks for your sharing!