sail-sg / Adan

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
Apache License 2.0
756 stars 64 forks source link

About the convergence trend comparison with Adamw in ViT-H #16

Open haihai-00 opened 2 years ago

haihai-00 commented 2 years ago

Hi, Thank you very much for your brilliant work on Adan! And from you paper, it said Adan should get a lower loss (both Train and test) than Adamw according to Figure 1. However, I got a higher training loss with Adan than AdamW in ViT-H: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Steps | Adamw_train_loss | Adan_train_loss -- | -- | -- 200 | 6.9077 | 6.9077 400 | 6.9074 | 6.9075 600 | 6.9068 | 6.9073 800 | 6.9061 | 6.907 1000 | 6.905 | 6.9064 1200 | 6.9036 | 6.9056 1400 | 6.9014 | 6.9044 1600 | 6.899 | 6.9028 1800 | 6.8953 | 6.9003 2000 | 6.8911 | 6.8971 2200 | 6.8848 | 6.8929 2400 | 6.8789 | 6.8893 2600 | 6.8699 | 6.8843 2800 | 6.8626 | 6.8805 3000 | 6.8528 | 6.8744 3200 | 6.8402 | 6.868 3400 | 6.8293 | 6.862 3600 | 6.8172 | 6.8547 3800 | 6.7989 | 6.8465 4000 | 6.7913 | 6.8405

I used the same HPs as AdamW and only changed beta from (0.9, 0.999) to (0.9, 0.92, 0.999). I only trained for few steps to see the trend. But it seems the loss gap from AdamW is quite big, should I change other HPs to better using Adan? How can I get a lower Loss than AdamW? I noticed that Adan prefers a large batch size in Vision tasks, should we using a larger batch size? Or should I train with more steps to see the trend? Thank you!

XingyuXie commented 2 years ago

@haihai-00 Hi, I suggest referring to the HPs we use for ViT-B and ViT-S. At least, you may try the default betas (0.98,0.92,0.99) and set wd to 0.02. To make more progress on your task, I will try to train ViT-H now. So, please provide more HPs about your setting.

haihai-00 commented 2 years ago

Thank you! We are using the HPs provided from this paper: https://arxiv.org/pdf/2208.06366.pdf F Hyperparameters for Image Classification Fine-tuning for ViT-L/16. This is the HPs provided for ViT-L, but we use it since we didn't find released official HPs for ViT-H.

XingyuXie commented 2 years ago

It seems that you are performing the model finetuning and do not train from scratch, right? Actually, we have provided the results for fine-tuning MAE-ViT-Large here.

Moreover, I also fine-tune the MAE-ViT-H on its official pre-trained model (obtain 86.9% after 50 epochs). You may add my WeChat: xyxie_joy, and I can send you the log file.