sail-sg / Adan

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
Apache License 2.0
744 stars 63 forks source link

Some questions about learning rate. #29

Closed stella-von closed 1 year ago

stella-von commented 1 year ago

Thank you for your brilliant work.

I want to ask some questions about Adan's learning rate.

Does Adan use learning rate decay in the paper? Is the Adan optimizer sensitive to the initial learning rate? How to set the learning rate compared with adam under the same task conditions?

Thank you!

XingyuXie commented 1 year ago

@zxc0074869 Hi,

A1: We use the same setting as the previous SoTAs. For example, on ViT, we utilize the cosin LR decay as AdamW and LAMB did. While for the dream diffusion experiments, we do not decay the LR since the original setting utilizes the constant LR during training. For more details, we have released all the config files. You are welcome to check them out there.

A2: Not quite sensitive to peak LR, but Adan prefers the larger peak LR since we have more terms in the denominator. Hence, you may also increase the warm-up steps when using the larger peak LR.

A3: As usual, I may set 5* adam_lr as the first try for Adan's LR and choose a ten times smaller weight decay than Adam. For stability, I may x2 warmup steps in this case.

stella-von commented 1 year ago

@XingyuXie Thank you very much for your suggestions! Adan optimizer is excellent, and has achieved better results than adam in our image downstream task.

XingyuXie commented 1 year ago

You are welcome. And if Adan doesn't get any good results, please feel free to present the setting here. We may work together to find the proper settings. And we have helped several users to obtain new SoTAs by Adan in their settings. So, you're welcome to talk about this with us.

BZW, you may try to tune beta3 for Adan. The choices are [0.9, 0.95, 0.999].

stella-von commented 1 year ago

Another question, have you tried the effect of combining EMA (exp. moving average) with Adan?

In my task (image super-resolution), we need to use EMA=0.999 to get better results. When using the adam optimizer, it is better to set beta2 to 0.99 instead of the common 0.999.

EMA=0.9999 is used in MAE. Does the parameter setting of [0.9, 0.95, 0.999] achieve better results in MAE training?

XingyuXie commented 1 year ago

EMA can slightly benefit the final results when training ViT-B from scratch. The performance may also depend on the framework. We found that in Timm, ema has little effect on the Acc, but in MMCls, the result is sensitive to ema.

For MAE, the official implementation has changed beta2 from 0.999 to 0.95, and a trick for Adam or AdamW is to use a small beta2 for the large model training, especially for pre-training. For Adan, beta3 has the same roles as beta2 in Adam. we choose beta3=0.9 for MAE, but recently, I found that 0.99 is okay when the training epoch is small. Moreover, when the training epoch is small, you may try larger min_lr and smaller weight decay.

The official MAE seems does not use the EMA.

stella-von commented 1 year ago

I realized I was wrong. In the appendix of MAE paper, supervised training of ViT-L/H from Scratch used EMA=0.9999 to improve performance. Thank you very much!

XingyuXie commented 1 year ago

Thanks for pointing this out. Their setting also improves the results for ViT-B. We have tried ema with 0.9999 in the MMcls framework with Adan, and the result is 82.7+% with EMA and around 82.6% w/o EMA. EMA's benefits may decay when the Acc. is increasing since EMA improved the Acc. from 8321% to 82.3% in the MAE appendix setting.