Closed milancurcic closed 1 year ago
@milancurcic I first learned about AdamW from the Keras Optimizer documentation and implemented it based solely on the provided paper, I did not refer the PyTorch docs/Keras Source and missed this.
Furthermore, yes I can see in Pytorch Docs that they multiply with learning rate as well. Since PyTorch is widely used, I also think it should be right. I will update our implementation and create a new PR. Thank you for bringing this to my attention!
Resolved in #150
Currently, we wrote:
https://github.com/modern-fortran/neural-fortran/blob/6adc1c25efe36f142b9611570ad618f8d2c9429b/src/nf/nf_optimizers.f90#L188-L190
However, I'm looking at the paper and PyTorch docs again.
In the paper, in Algorithm 2, line 12, we have $\lambda \theta_{t-1}$ (in our code,
self % weight_decay_decoupled * param
) multiplied by the schedule multiplier $\eta_t$, but not the learning rate $\alpha$.In the PyTorch docs for AdamW, $\lambda \theta_{t-1}$ is multiplied by the learning rate. I can also see that from the source code. I'm not familiar with PyTorch design, but it's possible that the schedule multiplier is already accounted for in the learning rate.
I looked at Keras source and I don't even see where and if the weight decay is even used (??).
@Spnetic-5 do you also see the same discrepancy between the paper and the PyTorch docs like I do?
If yes, I suggest that we multiply it with the learning rate in our code as well. I trust more that the PyTorch implements it correctly than that we are correctly interpreting the paper (and papers have typos, of course).