I would like to ask a problem about reading model from checkpoint.
It seems that the code couldn't load the model correctly from some checkpoints and it always start to train from scratch.
For example, yesterday I trained the model for six hours (29,200 iters completed), and today I use the same training config.
The output states that it correctly store the model from step = 29,200, but the first iteration of this second training is still step: <<<<< 100/108650 >>>>>, and the validation and testing psnr isn't at the level of 29,200 iters too.
Is there any thing that I have to modify for checkpoint loading (e.g. in train_json ?) or I miss something here?
Hi,
Thanks for your amazing work.
I would like to ask a problem about reading model from checkpoint.
It seems that the code couldn't load the model correctly from some checkpoints and it always start to train from scratch.
For example, yesterday I trained the model for six hours (29,200 iters completed), and today I use the same training config.
The output states that it correctly store the model from step = 29,200, but the first iteration of this second training is still step: <<<<< 100/108650 >>>>>, and the validation and testing psnr isn't at the level of 29,200 iters too.
Is there any thing that I have to modify for checkpoint loading (e.g. in train_json ?) or I miss something here?
Thanks in advance.