juntang-zhuang / Adabelief-Optimizer

Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"
BSD 2-Clause "Simplified" License
1.05k stars 109 forks source link

fine-tune with bert models #42

Closed JaheimLee closed 3 years ago

JaheimLee commented 3 years ago

Have you ever tested adabelief for fine-tuning bert models? And what's the recommended hyper-parameters?

juntang-zhuang commented 3 years ago

@JaheimLee Hi, I did not test with fine-tune bert, bert is too large for me. I tested a small transformer here https://github.com/juntang-zhuang/fairseq-adabelief , it seems the default in adabelief-pytorch==0.2.0 works. eps=1e-16 helps. I'm not so sure about rectify, sometimes it helps sometimes not, perhaps need some tuning. Other hyper-parameters, such as lr, beta, the same value as Adam works. BTW, if you use fp16 to accelerate, v0.2.0 might be problematic because eps=1e-16 is rounded to 0 in fp16. A by-pass is to forward and backward in fp16, but update parameter in fp32. See the link below https://github.com/juntang-zhuang/Adabelief-Optimizer/issues/31#issuecomment-757552023, we are considering adding this to the next release

JaheimLee commented 3 years ago

ok, thanks!