Open rwightman opened 2 years ago
@rwightman Oh hey Ross! Glad to see you are keeping up with the bleeding edge :) I just sent you an email a moment ago about your CLIP experiment
So I did do it the way in the paper initially, but I had to initialize n
to grad ** 2
as was done in the restart condition https://github.com/lucidrains/Adan-pytorch/commit/14ec8b31b90c57df9ce9a9a151ec833c0854e989#diff-61c9ea3d62e9746a1092013f1c4d8804f28e654e6bb00da8cd98a527bedc7139R53 for it not to explode for my task (which is a small GPT)
However, I was chatting with @sdtblck and he told me he zero initted everything, so I tried it, and could not see a difference. So I just left it like that for simplicity
Are you seeing a big difference following the careful init as in the pseudocode?
I'm seeing really poor results if state['n']
is initialized as zeros as you have in your code
not very rigorous, but blue is adam (baseline), red is with the careful init with grad squared init, purple is with zero init, and brown is the careful init, but without grad squared
@lucidrains I've also been doing some not very scientific comparisons (restart train with same seed) and see what happens for in the case of one network (a vit-cnn hybrid), one random init. But I am seeing what you are so far
Careful init w/ n == 0 is not great. All zeros is better. Now trying careful + n == grad ** 2...
@rwightman awesome! everyone would be eager to hear your results, which is much more authoritative than my toy tasks haha
@lucidrains so, two network archs now, running through the variations, all zeros with no special case init definitely appears to be the winner in these tests of limited scope. Hmm...
Sorry for making something confused here. Adan indeed has the bias correction in the implementation, but we need to consist the algorithm presentation with the theoretical analysis. Hence, we did not explicitly emphasize it in Algorithm1. We'll release the code in a few days (2-3 days since we have a code review procedure). The log and config files will release together. @rwightman
@XingyuXie Hi Xingyu and thanks for the interesting paper
I tested out the bias correction and indeed seeing a slight improvement https://github.com/lucidrains/Adan-pytorch/commit/3911a86e41624a5048e687e18d451b3fd5007242 Let me know if you see anything else that does not look quite right!
@lucidrains Thanks for updating, the following are some minor modifications. When we implement Adan, we refer to some optimizer's implementation in timm.
Line 55: state['prev_grad'] = grad
Line 85-86:
correct_m = 1 / bias_correct1 # correction term for m'
correct_v = 1 / bias_correct2 # correction term for v
Line 91:
weighted_step_size = lr / ((n.sqrt()/sqrt_bias_correct3).add_(eps))
Tips:
weight_decay = 0.02
seems to be suitable for most experiments.@XingyuXie thanks for the code review!
@lucidrains You're welcome.
By increasing LR and tuning the warm steps, the performance may be further improved. Have fun using Adan.
@XingyuXie i using the optimizer visualization for verification, it feels that the adan algorithm is less robust than other algorithms. and my tf implementation is here, and visualization
Sorry that I am not quite familiar with TF, I have tried to add your Wechat to send our adan.py
(implemented with PyTorch). No response has been received yet. @cpuimage
Thank @cpuimage I have visualized Adan on two toy cases. But we must point out that practical performance is more important. Since a bunch of optimizers (e.g., Adabound and Yogi) can handle the two cases. But they vary a lot in practical DNN training.
I took Adan for a spin today, it looks promising.
I am just training my fav timm backbone convnext_tiny
with nice results.
The only downside I see is that it's slower, quite slower than Adam.
https://wandb.ai/capecape/adan_optimizer/reports/Adan-The-new-optimizer-that-challenges-Adam--VmlldzoyNTQ5NjQ5
There is also a nice implementation for fastai
by Benjamin Warner here:
https://github.com/warner-benjamin/fastxtend/blob/main/nbs/optimizers.adan.ipynb
Hi, @tcapelle It can be seen from the experimental results released by you here that the Acc. of Adan's three trials are 71.8/ 75.5/ 74.0, while the results of Adam's three trials are 72.2/71.4/71.5.
It seems that this result is not consistent with the curve drawn in the blog. But it is also possible that I missed some key details.
We really appreciate your detailed experiments and suggestions.
BZW, our code has been released at: https://github.com/sail-sg/Adan.
It also contains the config files and results of ConvNext. You may refer to and welcome any feedback.
Allo @lucidrains , I've been fiddling with this optimizer, looking promising so far. I was looking for other interpretations out there for my doubts re no bias correction... I'm assuming it's deemed unecessary due to the explicit m0 and v1 init, but wasn't 100% sure it wasn't just left out for clarity.
I noticed you left m0 as zero, and v1 as interpolation with zero init... did you experiment with that vs the notes in paper, Algorithm 1?
The core of my attempt below (note I flipped the betas to be comparable to adam/lamb/etc: .98, .92, .99)