We can see that if relative_step is True, the input learning rate by users is ignored and instead the learning rate is time-dependent defined as:
if param_group["relative_step"]:
min_step = (
1e-6 * param_state["step"]
if param_group["warmup_init"]
else 1e-2
)
rel_step_sz = min(min_step, 1.0 / math.sqrt(param_state["step"]))
That means the learning rate is defined as min(1e-6*t, 1/sqrt(t)) if warmup_init is set to True and min(1e-2, 1/sqrt(t)) otherwise. This hard-coded values 1e-6 and 1e-2 is not an optimal choice and the best values are data-dependent. I would suggest to change those lines to:
if param_group["relative_step"]:
min_step = (
param_group["lr"] * param_state["step"]
if param_group["warmup_init"]
else param_group["lr"]
)
rel_step_sz = min(min_step, 1.0 / math.sqrt(param_state["step"]))
That enables the users to configure those hyper-parameters via the input learning rate.
The current implementation of the adafactor is consistent with the paper's default hyperparameters choice. In particular, in the get_lr function at https://github.com/jettify/pytorch-optimizer/blob/19c3e41952b94f2d60db06e559ee9a1433b25e53/torch_optimizer/adafactor.py#L85
We can see that if relative_step is True, the input learning rate by users is ignored and instead the learning rate is time-dependent defined as:
That means the learning rate is defined as min(1e-6*t, 1/sqrt(t)) if warmup_init is set to True and min(1e-2, 1/sqrt(t)) otherwise. This hard-coded values 1e-6 and 1e-2 is not an optimal choice and the best values are data-dependent. I would suggest to change those lines to:
That enables the users to configure those hyper-parameters via the input learning rate.