kyleliang919 / C-Optim

When it comes to optimizers, it's always better to be safe than sorry
MIT License
125 stars 4 forks source link

Question about Adamw #3

Open thomasw21 opened 5 days ago

thomasw21 commented 5 days ago

Hello!

Interesting work! Curves look promising! I wanted to ask about the implementation of adamw:

https://github.com/kyleliang919/C-Optim/blob/c360da388bf35e2066e7d390d3359631ade66318/c_adamw.py#L124-L126 seems to apply weight decay on the updated weights whereas torch applies it prior to the update https://github.com/pytorch/pytorch/blob/5212ec38794601a4f8bbb3677af998d56b07f1b5/torch/optim/adamw.py#L405

Roughly it would mean changing line 13 in Algorithm 2 w_t ← w_t − ϵ_t γ w_{t-1}.

Was this a deliberate choice? Am I reading something wrong?

kyleliang919 commented 4 days ago

Hi there, sorry for the confusion. this was inherited from some legacy code. It shouldn't have any impacts to the optimizer quality. Because the end of current step means the start of next step. The minor difference is only on the lr, if there is a learning rate scheduler.

I will update the code to align it with standardized implementations.