Closed XuezheMax closed 3 years ago
To be honest all comparisons are not fair in one way to other. But in general I agree with you. I am going to add a note that visualizations is not good way to select optimizer.
I happy to merge any PRs with improvements to visualisations, I have few things in mind also like use search for more hyper parameters etc. just have not managed to do it yet.
Thanks a lot for your response.
Here is PR with that add warning https://github.com/jettify/pytorch-optimizer/pull/222 Please create PR if you want improve messaging there.
The message is great. Thanks!
Hi,
Thanks a lot for this great repo. For the comparison in the Visualizations example, I found that for each config, you run 100 updates. I am concerned that 100 is too small so that it would favor optimizers that have fast convergence in the first few updates.
For other optimizers that the convergence is relatively slow at beginning, it would select large lr. This could lead to unstable convergence for these optimizers.
Moreover, for hyper-parameter search, the objective is the distance between the last step point and the minimum. I think the function value of the last step point may be a better objective.
At last, some optimizers implicitly implement learning rate decay (such as AdaBound and RAdam), but some not. But in your comparison, no explicit learning rate schedule is used.