Closed casinca closed 4 weeks ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Thanks a lot for the PR. Yes, you are right, since we retrieve the peak LR from the optimizer in the training function, we should initialize the optimizer with the peak LR.
I agree with your update, but I'll set the peak_lr to 0.001 so that the loss and plots afterwards don't change too much (to make it a bit less confusing for the readers)
Regarding the second point, that's a great catch as well. I just double checked and you are correct!
While toying around to compare with my code, there are these 2 things that I'm not sure about in the complete training func from D.4 let me know what do you think
You showed 2 ways of passing the
peak_lr
value to the optimizer: Directly passed as an argument to thetrain_model
function or retrieving it using the optimizers parameters inside thetrain_model
function withpeak_lr = optimizer.param_groups[0]["lr"]
which is the way implemented in the notebook and the book.But in the code, the
lr
argument foroptimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.1)
is never passed aslr=peak_lr
thus defaulting to the lr AdamW's default of 1e-3 instead of thepeak_lr = 5e-4
when we retrieve withpeak_lr = optimizer.param_groups[0]["lr"]
There is a gap for the gradients clipping, there's no clipping for the 1st step after the warmup ends when
global_step = warmup_steps
because the warmup stops atif global_step < warmup_steps:
and the clipping starts atif global_step > warmup_steps:
I'm not sure that was intended to have no clipping when lr is at max because you also mentioned:
Example of an output with
warmup_steps = 18
and prints yes/no under the above conditions