Always getting NaNs in long training

lucidrains / lion-pytorch

🦁 Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch

MIT License

2.02k stars 49 forks source link

Always getting NaNs in long training #33

Open danbochman opened 10 months ago

danbochman commented 10 months ago

I've been experimenting with the LION optimizer in your other (great) Imagen repository. I can share my anecdotal experience and combinations:

Models of different sizes 0.2B, 0.7B and 1B params.
Betas such as beta1 0.95 and beta2 0.98
Learning rates 1e-4, 3e-5 and 1e-5.
Triton kernel turned both True and False.

Training was indeed fast but unfortunately in the end always ended up yielding NaNs.

I think a potential issue could be how LION interacts with a warmup schedule; I am not sure if you're supposed to do warmup with this optimizer or not (which I always did).

ysesst93013 commented 9 months ago

I have same problem :(

SergeySakharovskiy commented 8 months ago

same NaN issue with CosineAnnealing scheduler after the first epoch.

xiangning-chen commented 8 months ago

I've been experimenting with the LION optimizer in your other (great) Imagen repository. I can share my anecdotal experience and combinations:

Models of different sizes 0.2B, 0.7B and 1B params.

Betas such as beta1 0.95 and beta2 0.98

Learning rates 1e-4, 3e-5 and 1e-5.

Triton kernel turned both True and False.

Training was indeed fast but unfortunately in the end always ended up yielding NaNs.

I think a potential issue could be how LION interacts with a warmup schedule; I am not sure if you're supposed to do warmup with this optimizer or not (which I always did).

May I know the learning rate schedule you are using?

zjutzyl commented 7 months ago

same issue, i set a big weight decay to avoid it. i suppose that 'update=symbol * lr' enlarging abs(parameter) while symbol not changing.

lindakasabian commented 6 months ago

same here. sudden nan losses during 100 e training with onecyclelr and clipping