mogvision / ADL

MIT License
56 stars 12 forks source link

Problem of resume training #8

Open shengyenlin opened 1 year ago

shengyenlin commented 1 year ago

Hi,

Thanks for your amazing work.

I would like to ask a problem about reading model from checkpoint.

It seems that the code couldn't load the model correctly from some checkpoints and it always start to train from scratch.

For example, yesterday I trained the model for six hours (29,200 iters completed), and today I use the same training config.

The output states that it correctly store the model from step = 29,200, but the first iteration of this second training is still step: <<<<< 100/108650 >>>>>, and the validation and testing psnr isn't at the level of 29,200 iters too.

Is there any thing that I have to modify for checkpoint loading (e.g. in train_json ?) or I miss something here?

Thanks in advance.