Closed seanpmorgan closed 5 years ago
That would be awesome indeed. For the record, here is a Keras implementation (not official): https://github.com/CyberZHG/keras-radam.
This is going to be a great addon ! For any one who is looking out for another detailed comparison can read through - https://medium.com/@lessw/new-state-of-the-art-ai-optimizer-rectified-adam-radam-5d854730807b
I vote for this as well. Looking for this!
Great addition. Can I try implementation of this? @seanpmorgan
@SSaishruthi Sure! I know @sayoojbk has also shared interest in helping with this so if you could make a WIP PR as soon as you get it started that'd be great so we could have a few eyes on this and push it through.
Looking forward to that!
Seems that there is an unofficial implementation for TF/Keras. https://github.com/CyberZHG/keras-radam
I have found one for TF: https://github.com/taki0112/RAdam-Tensorflow
Thanks for the links. I have marked all the links and planning to kick start with the implementation after this weekend. Just done with other priority change. Will keep posted.
@SSaishruthi it looks like RAdam has already an improvement called 'RAdam+lookahead'. One possible implementation of lookahead: https://github.com/bojone/keras_lookahead (been testing this one myself)
It's called 'Ranger' (the combination of RAdam + lookahead): Small article about it: https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d
Paper: https://arxiv.org/abs/1907.08610v1 Lookahead Pytorch implementation: https://github.com/lonePatient/lookahead_pytorch/blob/master/optimizer.py
@luminoso however, it seems that that repo does not support TensorFlow.
Thanks for the links. I have marked all the links and planning to kick start with the implementation after this weekend. Just done with other priority change. Will keep posted.
Hi @SSaishruthi I know you're working on several different things, including the core migration of F1. Would you be okay with @AakashKumarNain taking a look at this one as he has expressed interest. Would love for you to help review any implementation.
@seanpmorgan Sure. Will collaborate along so that we keep things going.
@seanpmorgan @SSaishruthi. The keras implementation as pointed out in the comments LGTM. Also, it is written with Optimizer_v2
api. Take a look
https://github.com/CyberZHG/keras-radam/blob/master/keras_radam/optimizer_v2.py
Ping @CyberZHG. Would it be okay to use your implementation as part of Addons? The license you have on it looks like it'd be okay -- but wanted to get your permission/see if you'd like to contribute it yourself?
Yeah it is fair if @CyberZHG just adds it here. Most of the work is already done in that.
I'll try to migrate the codes and make a PR in the next few days.
Is available an official implementation now?
Is available an official implementation now?
@Alessiobrini https://www.tensorflow.org/addons/api_docs/python/tfa/optimizers/RectifiedAdam
Is available an official implementation now?
@Alessiobrini https://www.tensorflow.org/addons/api_docs/python/tfa/optimizers/RectifiedAdam
Thank you! Anyway, if I set warmup proportion and total steps, the optimizer doesn't seem to do the proper learning rate warmup. I don't know if I'm doing something wrong, but the doc says that is only sufficient to specify parameters to have the warmup schedule.
One question. I've been looking at how weight decay is implemented here, in the original RAdam paper and in the paper originally introducing weight decay for Adam (also known as AdamW and cited by the RAdam paper).
The implementations seem to be pretty much the same, but there's a subtle difference: in AdamW the weight decay can also be scheduled for warm restarts, and in fact they propose a normalized value for those cases. Looking at the TF addons AdamW implementation you can see that both learning rate and weight decay argument support a callable: that is, a LearningRateSchedule. However, the RectifiedAdam implementation only does so for the learning rate.
Looking at the code it looks like this should be relatively trivial to change: instead of using the weight decay as a constant hyperparameter, it would need to be handled in a similar way to how decayed_lr works here.
Was there any reason to not also support weight decay schedulers in this case? I imagine you can get the same result by passing weight decay as a tensor with the scheduling already applied, but by doing so the scheduler state might be lost when serializing and deserializing back.
@leandro-gracia-gil Thanks for your good question, Leandro. Would you mind filing an issue for it?
@facaiy Sure. Here it is: https://github.com/tensorflow/addons/issues/1908
System information
Describe the feature and the current behavior/state. New paper describing RAdam looks like a drop in replacement for Adam optimizer with better results.
https://arxiv.org/abs/1908.03265v1 https://github.com/LiyuanLucasLiu/RAdam