Open e-sha opened 5 years ago
Hi @e-sha - thanks for pointing this out! Offhand, yes, it looks like 0 would be a better result but will need to test and see. Can you test it if you have time today? I will try and test it later this evening and then can update if that appears to be the best option (which it appears to be). I have some other work from a couple other optimizers that might be better than 0 for first five but won't have time to test that until later (see RangerQH for example). Thanks!
I found that step_size is too high in the initial 5 steps. The problem is in the code:
If betas are set to (0.9, 0.999) the internal variables are changed as following:
Note, that step_size doesn't depend on gradient value and it scales learning_rate. Thus RAdam aggressively moves weights from their initial values, even if they have a good initialization.
Is it better to set step_size equal to 0 if N_sma < self.N_sma_threshhold?