lessw2020 / Ranger-Deep-Learning-Optimizer

Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase
Apache License 2.0
1.19k stars 176 forks source link

Too huge step_size at initialization stage #15

Open e-sha opened 5 years ago

e-sha commented 5 years ago

I found that step_size is too high in the initial 5 steps. The problem is in the code:

if N_sma >= self.N_sma_threshhold:
    step_size = math.sqrt((1 - beta2_t) * (N_sma - 4) / (N_sma_max - 4) * (N_sma - 2) / N_sma * N_sma_max / (N_sma_max - 2)) / (1 - beta1 ** state['step'])
else:
    step_size = 1.0 / (1 - beta1 ** state['step'])

If betas are set to (0.9, 0.999) the internal variables are changed as following:

state['step']| step_size
------------------------------
        1    |     10
        2    |5.26315789
        3    |3.6900369
        4    |2.90782204
        5    |2.44194281
        6    |0.00426327
        7    |0.00524248
        8    |0.00607304
        9    |0.00681674
       10    |0.00750596

Note, that step_size doesn't depend on gradient value and it scales learning_rate. Thus RAdam aggressively moves weights from their initial values, even if they have a good initialization.

Is it better to set step_size equal to 0 if N_sma < self.N_sma_threshhold?

lessw2020 commented 5 years ago

Hi @e-sha - thanks for pointing this out! Offhand, yes, it looks like 0 would be a better result but will need to test and see. Can you test it if you have time today? I will try and test it later this evening and then can update if that appears to be the best option (which it appears to be). I have some other work from a couple other optimizers that might be better than 0 for first five but won't have time to test that until later (see RangerQH for example). Thanks!