nan loss in VICEReg - Githubissues

rzamarefat commented 1 year ago

Hi. Thanks U for this awesome repo. I have followed the code found (https://github.com/lightly-ai/lightly/blob/master/examples/pytorch/vicreg.py) and change the backbone to ResNet152 and the input_size in ImageCollateFunction to 224 to be tailored for ResNet152. Also the projection head dims assigned correctly to 2048. But the problem is that after a few steps in the training loop loss becoms so large that it returns inf and nan afterwards. Any suggestion to solve would be appreciated. Screenshot from 2023-01-10 04-06-08

guarin commented 1 year ago

Hi, thanks a lot for the issue report!

We noticed that the VicReg loss is quite sensitive to training parameters and the optimizer. To make training more stable you can do the following:

Use LARS optimizer (available in Pytorch Lightning Bolts: https://github.com/Lightning-AI/lightning-bolts/blob/master/pl_bolts/optimizers/lars.py)
Add learning rate warmup (paper uses 10 epochs)
Use base learning rate of 0.2 for batch size 256

The paper uses these settings:

We also just added a VicRegCollateFunction which uses the same augmentation parameters as the paper. You can either get it from the master branch or after the next release.

Let us know if this helps!

rzamarefat commented 1 year ago

Thanks for your fast reply. I will test these tips and let u know the result.

lightly-ai / lightly

nan loss in VICEReg #1032