Question about the decoupled weight decay in Adam

milancurcic commented 1 year ago

Currently, we wrote:

https://github.com/modern-fortran/neural-fortran/blob/6adc1c25efe36f142b9611570ad618f8d2c9429b/src/nf/nf_optimizers.f90#L188-L190

However, I'm looking at the paper and PyTorch docs again.

In the paper, in Algorithm 2, line 12, we have $\lambda \theta_{t-1}$ (in our code, self % weight_decay_decoupled * param) multiplied by the schedule multiplier $\eta_t$, but not the learning rate $\alpha$.

In the PyTorch docs for AdamW, $\lambda \theta_{t-1}$ is multiplied by the learning rate. I can also see that from the source code. I'm not familiar with PyTorch design, but it's possible that the schedule multiplier is already accounted for in the learning rate.

I looked at Keras source and I don't even see where and if the weight decay is even used (??).

@Spnetic-5 do you also see the same discrepancy between the paper and the PyTorch docs like I do?

If yes, I suggest that we multiply it with the learning rate in our code as well. I trust more that the PyTorch implements it correctly than that we are correctly interpreting the paper (and papers have typos, of course).

Spnetic-5 commented 1 year ago

@milancurcic I first learned about AdamW from the Keras Optimizer documentation and implemented it based solely on the provided paper, I did not refer the PyTorch docs/Keras Source and missed this.

Furthermore, yes I can see in Pytorch Docs that they multiply with learning rate as well. Since PyTorch is widely used, I also think it should be right. I will update our implementation and create a new PR. Thank you for bringing this to my attention!

Spnetic-5 commented 1 year ago

Resolved in #150

modern-fortran / neural-fortran

Question about the decoupled weight decay in Adam #152