Try Amos optimizator for seq2seq model. Compare with AdamW

Try for t5-large Evaluate results and speed on WDSQ and MINTAK

Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

Amos is a new optimizer that we propose to pre-train large language models. It is more efficient and converges faster than AdamW: ≤ 51% memory for slot variables, and better valid loss within ≤ 70% training time!Amos is a new optimizer that we propose to pre-train large language models. It is more efficient and converges faster than AdamW: ≤ 51% memory for slot variables, and better valid loss within ≤ 70% training time!

ArXiV: https://arxiv.org/abs/2210.11693

s-nlp / kbqa

Try Amos optimizator for seq2seq model. Compare with AdamW #72