valeoai / Maskgit-pytorch

unofficial MaskGIT reproduction in PyTorch
MIT License
161 stars 15 forks source link

Training loss jumping up when resuming training. #15

Closed leon-herbrik closed 4 months ago

leon-herbrik commented 5 months ago

Hey there,

when I load the model and optimizer state dict from a checkpoint and try to resume training, the training loss suddenly spikes up, removing a lot of the progress the previous training run brought. After a while it goes down again, but the training process is set back by a large margin.

Would you, by chance, know what causes this behavior?

Thanks a lot in advance!

llvictorll commented 4 months ago

Hello,

I haven't noticed this in my previous experiments, but maybe lowering the learning rate or doing learning rate warm-up for the first 1000~5000 steps can help stabilize the fine tuning?

Best,

Victor

leon-herbrik commented 4 months ago

I'll try that, thanks for the idea! I am currently fine-tuning the ImageNet base on the Ego4D dataset and the qualitative results are already looking quite good.

llvictorll commented 4 months ago

Good luck with that! Also the domain gap could be a reason for the loss jumping for the first steps, as ImageNet and EgoD4 are quite different.