kozistr / pytorch_optimizer

optimizer & lr scheduler & loss function collections in PyTorch
https://pytorch-optimizers.readthedocs.io/en/latest/
Apache License 2.0
242 stars 21 forks source link

AdamG: Towards Stability of Parameter-free Optimization #264

Closed Vectorrent closed 3 months ago

Vectorrent commented 3 months ago

https://arxiv.org/abs/2405.04376

I've been experimenting with parameter-free optimizers lately (like Prodigy), and came upon AdamG:

Hyperparameter tuning, particularly the selection of an appropriate learning rate in adaptive gradient training methods, remains a challenge. To tackle this challenge, in this paper, we propose a novel parameter-free optimizer, \textsc{AdamG} (Adam with the golden step size), designed to automatically adapt to diverse optimization problems without manual tuning. The core technique underlying \textsc{AdamG} is our golden step size derived for the AdaGrad-Norm algorithm, which is expected to help AdaGrad-Norm preserve the tuning-free convergence and approximate the optimal step size in expectation w.r.t. various optimization scenarios. To better evaluate tuning-free performance, we propose a novel evaluation criterion, \textit{reliability}, to comprehensively assess the efficacy of parameter-free optimizers in addition to classical performance criteria. Empirical results demonstrate that compared with other parameter-free baselines, \textsc{AdamG} achieves superior performance, which is consistently on par with Adam using a manually tuned learning rate across various optimization tasks.

I was able to hack together a version of AdamG in TFJS, and it performs fairly well! But I am not at all sure if my version is mathematically sound.

Would love to see an implementation of AdamG in Pytorch! So far as I'm aware, this code does not exist anywhere else. I'm opening a feature request here for posterity, though I might get around to implementing this PR myself, some day.

kozistr commented 3 months ago

@Vectorrent thanks for the suggestion!

I just implemented the AdamG optimizer based on the pseudo-code in the paper, here #265. if you have any suggestions or reviews, feel free to check and leave a comment :)

image

Vectorrent commented 3 months ago

That was quick! Thanks a lot, I'll be testing ASAP. I love this library 🙂

Vectorrent commented 3 months ago

Not trying to nitpick... but in the research, the authors set ηk = 1. That's the learning rate/step size, right? Do you think it would be better to set the default LR to 1.0 here as well, @kozistr?

From the "setup" section:

Unless otherwise specified, all Adam and Adam-type parameter-free optimizers are paired with a cosine learning rate scheduler. I.e., the default value of ηk in AdamG, D-Adapt Adam and Prodigy Adam is set to 1 with extra cosine annealing decay strategy...

kozistr commented 3 months ago

Not trying to nitpick... but in the research, the authors set ηk = 1. That's the learning rate/step size, right? Do you think it would be better to set the default LR to 1.0 here as well, @kozistr?

From the "setup" section:

Unless otherwise specified, all Adam and Adam-type parameter-free optimizers are paired with a cosine learning rate scheduler. I.e., the default value of ηk in AdamG, D-Adapt Adam and Prodigy Adam is set to 1 with extra cosine annealing decay strategy...

yeap, afaik, they used the default learning rate of 1.0.

umm... actually, I have no intuition about the learning rate of this optimizer now, however, I guess the main reason they used 1.0 is for the fair comparison with the previous works, and AdamG is a parameter-free, scale-free optimizer, assume that don't need to tune the parameters (e.g. lr, ...) empirically.

in short, the absolute value of 1.0 looks too high to train, however, it could be a proper step size for the update. of course, need more observations though.

maybe we could find some intuitions from other optimizers like prodigy, d-adaptation repos

Vectorrent commented 3 months ago

I don't have much intuition here, either. Given the fact that Prodigy and DAdapt methods also use a LR of 1.0, I'd dare say that these would be more appropriate defaults for AdamG:

lr = 1.0
p = 0.2
q = 0.24

Prodigy recommends NEVER changing the learning rate:

_We recommend using lr=1. (default) for all networks. If you want to force the method to estimate a smaller or larger learning rate, it is better to change the value of d_coef (1.0 by default). Values of dcoef above 1, such as 2 or 10, will force a larger estimate of the learning rate; set it to 0.5 or even 0.1 if you want a smaller learning rate.

I suppose the golden step in AdamG acts like the d_coef in Prodigy; it is what scales the learning rate, and makes the optimizer adaptive.

kozistr commented 3 months ago

I don't have much intuition here, either. Given the fact that Prodigy and DAdapt methods also use a LR of 1.0, I'd dare say that these would be more appropriate defaults for AdamG:

lr = 1.0
p = 0.2
q = 0.24

Prodigy recommends NEVER changing the learning rate:

_We recommend using lr=1. (default) for all networks. If you want to force the method to estimate a smaller or larger learning rate, it is better to change the value of d_coef (1.0 by default). Values of dcoef above 1, such as 2 or 10, will force a larger estimate of the learning rate; set it to 0.5 or even 0.1 if you want a smaller learning rate.

I suppose the golden step in AdamG acts like the d_coef in Prodigy; it is what scales the learning rate, and makes the optimizer adaptive.

I agree with you