Configurable step size instead of hard-coded default values for adafactor

The current implementation of the adafactor is consistent with the paper's default hyperparameters choice. In particular, in the get_lr function at https://github.com/jettify/pytorch-optimizer/blob/19c3e41952b94f2d60db06e559ee9a1433b25e53/torch_optimizer/adafactor.py#L85

We can see that if relative_step is True, the input learning rate by users is ignored and instead the learning rate is time-dependent defined as:

if param_group["relative_step"]:
            min_step = (
                1e-6 * param_state["step"]
                if param_group["warmup_init"]
                else 1e-2
            )
            rel_step_sz = min(min_step, 1.0 / math.sqrt(param_state["step"]))

That means the learning rate is defined as min(1e-6*t, 1/sqrt(t)) if warmup_init is set to True and min(1e-2, 1/sqrt(t)) otherwise. This hard-coded values 1e-6 and 1e-2 is not an optimal choice and the best values are data-dependent. I would suggest to change those lines to:

if param_group["relative_step"]:
            min_step = (
                param_group["lr"] * param_state["step"]
                if param_group["warmup_init"]
                else param_group["lr"]
            )
            rel_step_sz = min(min_step, 1.0 / math.sqrt(param_state["step"]))

That enables the users to configure those hyper-parameters via the input learning rate.

jettify / pytorch-optimizer

Configurable step size instead of hard-coded default values for adafactor #535