juntang-zhuang / Adabelief-Optimizer

Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"
BSD 2-Clause "Simplified" License
1.05k stars 109 forks source link

i use adabelief optimizer on fine-tune efficientb4 that acc is worse than Adam? #38

Closed daixiangzi closed 3 years ago

daixiangzi commented 3 years ago

adabelief params: optimizer= AdaBelief(model.parameters(), 0.001, betas=(0.9, 0.999),weight_decay=5e-4, eps=1e-8)

daixiangzi commented 3 years ago

optimizer = optim.Adam([{'params':base_params,'weight_decay':4e-5},{'params':fc_params,'weight_decay':4e-4}],lr=0.0003,betas=(0.9,0.99))

daixiangzi commented 3 years ago

could it be said that different layer different weight_decay? maybe the is problems,i try it later

juntang-zhuang commented 3 years ago

If Adam is better than sgd, then in Adabelief you should use a small eps such as 1e-16. If SGD is better than Adam, then recommended eps is 1e-8. Note that in AdaBelief eps is used as sqrt( v + eps), in Adam is sqrt(v) + eps. You need a value that eps_adabelief = eps_adam^2 to make AdaBelief behave similarly adaptive as Adam.

juntang-zhuang commented 3 years ago

Please use 0.2.0 and read the hyper parameter recommendations

juntang-zhuang commented 3 years ago

BTW, you should also check if decoupled weight decay is used, and if the decoupled weight decay matches what you use in Adam. The weight decay is very different for Adam and AdamW, similarly for AdaBelief with and without weight decouple.

daixiangzi commented 3 years ago

BTW, you should also check if decoupled weight decay is used, and if the decoupled weight decay matches what you use in Adam. The weight decay is very different for Adam and AdamW, similarly for AdaBelief with and without weight decouple.

I checked, weight decouple=false,fixed_decay=false,rectify=false,amsgrad=false when use adabelief .

daixiangzi commented 3 years ago

I guess that the result should be similary with Adam when optimizer = optim.AdaBelief([{'params':base_params,'weight_decay':4e-5},{'params':fc_params,'weight_decay':4e-4}],lr=0.0003,betas=(0.9,0.99),weight_decouple=false,fixed_decay=false,rectify=false,amsgrad=false)?

juntang-zhuang commented 3 years ago

Depending on the version you are using, if default eps=1e-16 (version >=0.1.0), then AdaBelief should behave similar to Adam.

If Adam outperforms SGD, my experience is rectify often helps, also decoupled weight decay helps. weight_decay parameter might need some tuning if you turn on decoupled decay, just as weight_decay needs tuning when switching from Adam to AdamW.

daixiangzi commented 3 years ago

ok ,thank you

daixiangzi commented 3 years ago

optim.AdaBelief([{'params':base_params,'weight_decay':4e-5},{'params':fc_params,'weight_decay':4e-4}],lr=0.0003,betas=(0.9,0.99),eps=1e-16,weight_decouple=true,fixed_decay=false,rectify=false,amsgrad=false) VS optim.AdamW([{'params':base_params,'weight_decay':4e-5},{'params':fc_params,'weight_decay':4e-4}],lr=0.0003,betas=(0.9,0.99)) ,acc that seems latter is better.what is reason?? version is v0.2.0

daixiangzi commented 3 years ago

optim.AdamW default eps is 1e-8,but the effect of small.

juntang-zhuang commented 3 years ago

What is the performance difference, is there statistically significant difference? Did you tune the hyper-params, or just pick one that works for AdamW, then set the same to Adabelief?

juntang-zhuang commented 3 years ago

BTW, how to does SGD compare to Adam in your task? What's the dataset and task, and the training epochs?

daixiangzi commented 3 years ago

BTW, how to does SGD compare to Adam in your task? What's the dataset and task, and the training epochs?

I don't use sgd,but I use Adam and AdamW.Now ,I training model on using adabelief,but the result is not yet coming out.At present AdamW >>Adam

daixiangzi commented 3 years ago

What is the performance difference, is there statistically significant difference? Did you tune the hyper-params, or just pick one that works for AdamW, then set the same to Adabelief?

val accuracy,maybe I only see one epoch ,waiting for some epoch ,the adabelief performance could be better

daixiangzi commented 3 years ago

now this my result after 15 epoch: AdamW: 5601/0,val_acc:0.917621,lr: 0.000300 5601/1,val_acc:0.925099,lr: 0.000299 5601/2,val_acc:0.949343,lr: 0.000297 5601/3,val_acc:0.930527,lr: 0.000295 5601/4,val_acc:0.955735,lr: 0.000293 5601/5,val_acc:0.963093,lr: 0.000290 5601/6,val_acc:0.919793,lr: 0.000286 5601/7,val_acc:0.961283,lr: 0.000282 5601/8,val_acc:0.936316,lr: 0.000277 5601/9,val_acc:0.953323,lr: 0.000272 5601/10,val_acc:0.958871,lr: 0.000267 5601/11,val_acc:0.961766,lr: 0.000261 5601/12,val_acc:0.965384,lr: 0.000254 5601/13,val_acc:0.951031,lr: 0.000247 5601/14,val_acc:0.967676,lr: 0.000240

Adabelief: 5601/0,val_acc:0.911108,lr: 0.000300 5601/1,val_acc:0.938246,lr: 0.000299 5601/2,val_acc:0.943794,lr: 0.000297 5601/3,val_acc:0.918466,lr: 0.000295 5601/4,val_acc:0.944398,lr: 0.000293 5601/5,val_acc:0.934266,lr: 0.000290 5601/6,val_acc:0.939935,lr: 0.000286 5601/7,val_acc:0.945001,lr: 0.000282 5601/8,val_acc:0.944277,lr: 0.000277 5601/9,val_acc:0.956459,lr: 0.000272 5601/10,val_acc:0.942468,lr: 0.000267 5601/11,val_acc:0.961525,lr: 0.000261 5601/12,val_acc:0.946086,lr: 0.000254 5601/13,val_acc:0.947533,lr: 0.000247 5601/14,val_acc:0.948981,lr: 0.000240

juntang-zhuang commented 3 years ago

Thanks for the info, here are a few tips that might help.

(1) Turn on rectify. It seems in your case a small eps 1e-16 is better than 1e-8, so I guess ada-family would outperform sgd family. In this case, rectify often help.

(2) Try different learning rates. Typically AdaBelief takes a large stepsize than Adam, so AdaBelief is typically better when training from scratch. But the stepsize might be too large for a finetune, also after 15 epochs the learning rate is still large compared to initial lr. Perhaps a smaller lr would help if your goal is to achieve a high score within a short period, e.g. 20 epochs. If you can wait for it to train for longer, say 200 epochs or more, so lr decays to a quite small value, then you should wait for the final result, because the initial lr is much larger than final lr and optimizers could behave very differently.

(3) Did you use gradient clip or other tricks during training? AdaBelief is inherently incompatible with gradient clip.

(4) Tune other hyper-parameters, especially beta value, AdaBelief could have a different optimal beta value than AdamW.

(5) I'm not so sure about this point, but somehow I suspect the weight_decay is too small in your case. Remember if you use decoupled decay, the weight is multiplied by 1-lr x weight_decay, lr=3e-4, weight_decay=4e-5, their product is 1.2e-8; the default weight_decay for AdamW is 1e-2, much larger than typical values for SGD, such as 1e-4; if you don't use decoupled decay, then the grad is added by weight_value x weight_decay, this value is typically larger than 1.2e-8.

daixiangzi commented 3 years ago

But the stepsize might be too large for a finetune, also after 15 epochs the learning rate is still large compared to initial lr.

I don't use grad clim op, I trained 50 epoch ,the result still AdamW>adabelief

daixiangzi commented 3 years ago

Thanks for the info, here are a few tips that might help.

(1) Turn on rectify. It seems in your case a small eps 1e-16 is better than 1e-8, so I guess ada-family would outperform sgd family. In this case, rectify often help.

(2) Try different learning rates. Typically AdaBelief takes a large stepsize than Adam, so AdaBelief is typically better when training from scratch. But the stepsize might be too large for a finetune, also after 15 epochs the learning rate is still large compared to initial lr. Perhaps a smaller lr would help if your goal is to achieve a high score within a short period, e.g. 20 epochs. If you can wait for it to train for longer, say 200 epochs or more, so lr decays to a quite small value, then you should wait for the final result, because the initial lr is much larger than final lr and optimizers could behave very differently.

(3) Did you use gradient clip or other tricks during training? AdaBelief is inherently incompatible with gradient clip.

(4) Tune other hyper-parameters, especially beta value, AdaBelief could have a different optimal beta value than AdamW.

(5) I'm not so sure about this point, but somehow I suspect the weight_decay is too small in your case. Remember if you use decoupled decay, the weight is multiplied by 1-lr x weight_decay, lr=3e-4, weight_decay=4e-5, their product is 1.2e-8; the default weight_decay for AdamW is 1e-2, much larger than typical values for SGD, such as 1e-4; if you don't use decoupled decay, then the grad is added by weight_value x weight_decay, this value is typically larger than 1.2e-8.

the weight is multiplied by 1-lr x weight_decay? or lr*weight_decay??

juntang-zhuang commented 3 years ago

The weight is multiplied by ( 1 - lr x weight_decay ) if you use decoupled weight decay, which is the same for RAdam, and AdamW, see the code in AdamW in PyTorch official https://github.com/pytorch/pytorch/blob/e44b2b72bd4ccecf9c2f6c18d09c11eff446b5a3/torch/optim/adamw.py#L73. If ( lr x weight_decay ) is too small, say 1e-8, then basically there are no weight decay regularization.

juntang-zhuang commented 3 years ago

But the stepsize might be too large for a finetune, also after 15 epochs the learning rate is still large compared to initial lr.

I don't use grad clim op, I trained 50 epoch ,the result still AdamW>adabelief

Sorry did not notice this message. Quite hard to draw a conclusion now. It's not only affected by epochs, but AdaBelief inheritently takes a larger step than Adam under the same lr in most cases. If you want me to look in details, you could share the code if possible.

daixiangzi commented 3 years ago

but AdaBelief inheritently takes a larger step than Adam under the same lr in most cases? means that I should use smaller learning rate? 发件人: "juntang-zhuang/Adabelief-Optimizer" <notifications@github.com>; 发送时间: 2021年1月7日(星期四) 上午10:51 收件人: "juntang-zhuang/Adabelief-Optimizer"<Adabelief-Optimizer@noreply.github.com>; 抄送: "代翔子"<543826458@qq.com>;"State change"<state_change@noreply.github.com>; 主题: Re: [juntang-zhuang/Adabelief-Optimizer] i use adabelief optimizer on fine-tune efficientb4 that acc is worse than Adam? (#38)

But the stepsize might be too large for a finetune, also after 15 epochs the learning rate is still large compared to initial lr.

I don't use grad clim op, I trained 50 epoch ,the result still AdamW>adabelief

Sorry did not notice this message. Quite hard to draw a conclusion now. It's not only affected by epochs, but AdaBelief inheritently takes a larger step than Adam under the same lr in most cases. If you want me to look in details, you could share the code if possible.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.

juntang-zhuang commented 3 years ago

I think so if your goal is for some finetune on the same task as pretrain. But if you care about finetune on a different task, it's hard to say if a smaller lr is better.

daixiangzi commented 3 years ago

e....ok

------------------ 原始邮件 ------------------ 发件人: "Juntang Zhuang"<notifications@github.com>; 发送时间: 2021年1月7日(星期四) 中午11:51 收件人: "juntang-zhuang/Adabelief-Optimizer"<Adabelief-Optimizer@noreply.github.com>; 抄送: "代翔子"<543826458@qq.com>; "State change"<state_change@noreply.github.com>; 主题: Re: [juntang-zhuang/Adabelief-Optimizer] i use adabelief optimizer on fine-tune efficientb4 that acc is worse than Adam? (#38)

I think so if your goal is for some finetune on the same task as pretrain. But if you care about finetune on a different task, it's hard to say if a smaller lr is better.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.

juntang-zhuang commented 3 years ago

@daixiangzi I'm closing this issue now. Please let me know if you could provide more info such as the code and task, otherwise it's hard to give useful suggestions.