Closed daixiangzi closed 3 years ago
optimizer = optim.Adam([{'params':base_params,'weight_decay':4e-5},{'params':fc_params,'weight_decay':4e-4}],lr=0.0003,betas=(0.9,0.99))
could it be said that different layer different weight_decay? maybe the is problems,i try it later
If Adam is better than sgd, then in Adabelief you should use a small eps such as 1e-16.
If SGD is better than Adam, then recommended eps is 1e-8.
Note that in AdaBelief eps is used as sqrt( v + eps)
, in Adam is sqrt(v) + eps
. You need a value that eps_adabelief = eps_adam^2 to make AdaBelief behave similarly adaptive as Adam.
Please use 0.2.0 and read the hyper parameter recommendations
BTW, you should also check if decoupled weight decay is used, and if the decoupled weight decay matches what you use in Adam. The weight decay is very different for Adam and AdamW, similarly for AdaBelief with and without weight decouple.
BTW, you should also check if decoupled weight decay is used, and if the decoupled weight decay matches what you use in Adam. The weight decay is very different for Adam and AdamW, similarly for AdaBelief with and without weight decouple.
I checked, weight decouple=false,fixed_decay=false,rectify=false,amsgrad=false when use adabelief .
I guess that the result should be similary with Adam when optimizer = optim.AdaBelief([{'params':base_params,'weight_decay':4e-5},{'params':fc_params,'weight_decay':4e-4}],lr=0.0003,betas=(0.9,0.99),weight_decouple=false,fixed_decay=false,rectify=false,amsgrad=false)?
Depending on the version you are using, if default eps=1e-16 (version >=0.1.0), then AdaBelief should behave similar to Adam.
If Adam outperforms SGD, my experience is rectify often helps, also decoupled weight decay helps. weight_decay parameter might need some tuning if you turn on decoupled decay, just as weight_decay needs tuning when switching from Adam to AdamW.
ok ,thank you
optim.AdaBelief([{'params':base_params,'weight_decay':4e-5},{'params':fc_params,'weight_decay':4e-4}],lr=0.0003,betas=(0.9,0.99),eps=1e-16,weight_decouple=true,fixed_decay=false,rectify=false,amsgrad=false) VS optim.AdamW([{'params':base_params,'weight_decay':4e-5},{'params':fc_params,'weight_decay':4e-4}],lr=0.0003,betas=(0.9,0.99)) ,acc that seems latter is better.what is reason?? version is v0.2.0
optim.AdamW default eps is 1e-8,but the effect of small.
What is the performance difference, is there statistically significant difference? Did you tune the hyper-params, or just pick one that works for AdamW, then set the same to Adabelief?
BTW, how to does SGD compare to Adam in your task? What's the dataset and task, and the training epochs?
BTW, how to does SGD compare to Adam in your task? What's the dataset and task, and the training epochs?
I don't use sgd,but I use Adam and AdamW.Now ,I training model on using adabelief,but the result is not yet coming out.At present AdamW >>Adam
What is the performance difference, is there statistically significant difference? Did you tune the hyper-params, or just pick one that works for AdamW, then set the same to Adabelief?
val accuracy,maybe I only see one epoch ,waiting for some epoch ,the adabelief performance could be better
now this my result after 15 epoch: AdamW: 5601/0,val_acc:0.917621,lr: 0.000300 5601/1,val_acc:0.925099,lr: 0.000299 5601/2,val_acc:0.949343,lr: 0.000297 5601/3,val_acc:0.930527,lr: 0.000295 5601/4,val_acc:0.955735,lr: 0.000293 5601/5,val_acc:0.963093,lr: 0.000290 5601/6,val_acc:0.919793,lr: 0.000286 5601/7,val_acc:0.961283,lr: 0.000282 5601/8,val_acc:0.936316,lr: 0.000277 5601/9,val_acc:0.953323,lr: 0.000272 5601/10,val_acc:0.958871,lr: 0.000267 5601/11,val_acc:0.961766,lr: 0.000261 5601/12,val_acc:0.965384,lr: 0.000254 5601/13,val_acc:0.951031,lr: 0.000247 5601/14,val_acc:0.967676,lr: 0.000240
Adabelief: 5601/0,val_acc:0.911108,lr: 0.000300 5601/1,val_acc:0.938246,lr: 0.000299 5601/2,val_acc:0.943794,lr: 0.000297 5601/3,val_acc:0.918466,lr: 0.000295 5601/4,val_acc:0.944398,lr: 0.000293 5601/5,val_acc:0.934266,lr: 0.000290 5601/6,val_acc:0.939935,lr: 0.000286 5601/7,val_acc:0.945001,lr: 0.000282 5601/8,val_acc:0.944277,lr: 0.000277 5601/9,val_acc:0.956459,lr: 0.000272 5601/10,val_acc:0.942468,lr: 0.000267 5601/11,val_acc:0.961525,lr: 0.000261 5601/12,val_acc:0.946086,lr: 0.000254 5601/13,val_acc:0.947533,lr: 0.000247 5601/14,val_acc:0.948981,lr: 0.000240
Thanks for the info, here are a few tips that might help.
(1) Turn on rectify. It seems in your case a small eps 1e-16 is better than 1e-8, so I guess ada-family would outperform sgd family. In this case, rectify often help.
(2) Try different learning rates. Typically AdaBelief takes a large stepsize than Adam, so AdaBelief is typically better when training from scratch. But the stepsize might be too large for a finetune, also after 15 epochs the learning rate is still large compared to initial lr. Perhaps a smaller lr would help if your goal is to achieve a high score within a short period, e.g. 20 epochs. If you can wait for it to train for longer, say 200 epochs or more, so lr decays to a quite small value, then you should wait for the final result, because the initial lr is much larger than final lr and optimizers could behave very differently.
(3) Did you use gradient clip or other tricks during training? AdaBelief is inherently incompatible with gradient clip.
(4) Tune other hyper-parameters, especially beta value, AdaBelief could have a different optimal beta value than AdamW.
(5) I'm not so sure about this point, but somehow I suspect the weight_decay is too small in your case. Remember if you use decoupled decay, the weight is multiplied by 1-lr x weight_decay, lr=3e-4, weight_decay=4e-5, their product is 1.2e-8; the default weight_decay for AdamW is 1e-2, much larger than typical values for SGD, such as 1e-4; if you don't use decoupled decay, then the grad is added by weight_value x weight_decay, this value is typically larger than 1.2e-8.
But the stepsize might be too large for a finetune, also after 15 epochs the learning rate is still large compared to initial lr.
I don't use grad clim op, I trained 50 epoch ,the result still AdamW>adabelief
Thanks for the info, here are a few tips that might help.
(1) Turn on rectify. It seems in your case a small eps 1e-16 is better than 1e-8, so I guess ada-family would outperform sgd family. In this case, rectify often help.
(2) Try different learning rates. Typically AdaBelief takes a large stepsize than Adam, so AdaBelief is typically better when training from scratch. But the stepsize might be too large for a finetune, also after 15 epochs the learning rate is still large compared to initial lr. Perhaps a smaller lr would help if your goal is to achieve a high score within a short period, e.g. 20 epochs. If you can wait for it to train for longer, say 200 epochs or more, so lr decays to a quite small value, then you should wait for the final result, because the initial lr is much larger than final lr and optimizers could behave very differently.
(3) Did you use gradient clip or other tricks during training? AdaBelief is inherently incompatible with gradient clip.
(4) Tune other hyper-parameters, especially beta value, AdaBelief could have a different optimal beta value than AdamW.
(5) I'm not so sure about this point, but somehow I suspect the weight_decay is too small in your case. Remember if you use decoupled decay, the weight is multiplied by 1-lr x weight_decay, lr=3e-4, weight_decay=4e-5, their product is 1.2e-8; the default weight_decay for AdamW is 1e-2, much larger than typical values for SGD, such as 1e-4; if you don't use decoupled decay, then the grad is added by weight_value x weight_decay, this value is typically larger than 1.2e-8.
the weight is multiplied by 1-lr x weight_decay? or lr*weight_decay??
The weight is multiplied by ( 1 - lr x weight_decay ) if you use decoupled weight decay, which is the same for RAdam, and AdamW, see the code in AdamW in PyTorch official https://github.com/pytorch/pytorch/blob/e44b2b72bd4ccecf9c2f6c18d09c11eff446b5a3/torch/optim/adamw.py#L73. If ( lr x weight_decay ) is too small, say 1e-8, then basically there are no weight decay regularization.
But the stepsize might be too large for a finetune, also after 15 epochs the learning rate is still large compared to initial lr.
I don't use grad clim op, I trained 50 epoch ,the result still AdamW>adabelief
Sorry did not notice this message. Quite hard to draw a conclusion now. It's not only affected by epochs, but AdaBelief inheritently takes a larger step than Adam under the same lr in most cases. If you want me to look in details, you could share the code if possible.
but AdaBelief inheritently takes a larger step than Adam under the same lr in most cases? means that I should use smaller learning rate? 发件人: "juntang-zhuang/Adabelief-Optimizer" <notifications@github.com>; 发送时间: 2021年1月7日(星期四) 上午10:51 收件人: "juntang-zhuang/Adabelief-Optimizer"<Adabelief-Optimizer@noreply.github.com>; 抄送: "代翔子"<543826458@qq.com>;"State change"<state_change@noreply.github.com>; 主题: Re: [juntang-zhuang/Adabelief-Optimizer] i use adabelief optimizer on fine-tune efficientb4 that acc is worse than Adam? (#38)
But the stepsize might be too large for a finetune, also after 15 epochs the learning rate is still large compared to initial lr.
I don't use grad clim op, I trained 50 epoch ,the result still AdamW>adabelief
Sorry did not notice this message. Quite hard to draw a conclusion now. It's not only affected by epochs, but AdaBelief inheritently takes a larger step than Adam under the same lr in most cases. If you want me to look in details, you could share the code if possible.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.
I think so if your goal is for some finetune on the same task as pretrain. But if you care about finetune on a different task, it's hard to say if a smaller lr is better.
e....ok
------------------ 原始邮件 ------------------ 发件人: "Juntang Zhuang"<notifications@github.com>; 发送时间: 2021年1月7日(星期四) 中午11:51 收件人: "juntang-zhuang/Adabelief-Optimizer"<Adabelief-Optimizer@noreply.github.com>; 抄送: "代翔子"<543826458@qq.com>; "State change"<state_change@noreply.github.com>; 主题: Re: [juntang-zhuang/Adabelief-Optimizer] i use adabelief optimizer on fine-tune efficientb4 that acc is worse than Adam? (#38)
I think so if your goal is for some finetune on the same task as pretrain. But if you care about finetune on a different task, it's hard to say if a smaller lr is better.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.
@daixiangzi I'm closing this issue now. Please let me know if you could provide more info such as the code and task, otherwise it's hard to give useful suggestions.
adabelief params: optimizer= AdaBelief(model.parameters(), 0.001, betas=(0.9, 0.999),weight_decay=5e-4, eps=1e-8)