Open thomasw21 opened 5 days ago
Hi there, sorry for the confusion. this was inherited from some legacy code. It shouldn't have any impacts to the optimizer quality. Because the end of current step means the start of next step. The minor difference is only on the lr, if there is a learning rate scheduler.
I will update the code to align it with standardized implementations.
Hello!
Interesting work! Curves look promising! I wanted to ask about the implementation of adamw:
https://github.com/kyleliang919/C-Optim/blob/c360da388bf35e2066e7d390d3359631ade66318/c_adamw.py#L124-L126 seems to apply weight decay on the updated weights whereas torch applies it prior to the update https://github.com/pytorch/pytorch/blob/5212ec38794601a4f8bbb3677af998d56b07f1b5/torch/optim/adamw.py#L405
Roughly it would mean changing line 13 in Algorithm 2
w_t ← w_t − ϵ_t γ w_{t-1}
.Was this a deliberate choice? Am I reading something wrong?