rasbt / LLMs-from-scratch

Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167
Other
34.13k stars 4.18k forks source link

potential little fixes `appendix-D4 .ipynb` #427

Closed casinca closed 4 weeks ago

casinca commented 4 weeks ago

While toying around to compare with my code, there are these 2 things that I'm not sure about in the complete training func from D.4 let me know what do you think

  1. You showed 2 ways of passing the peak_lr value to the optimizer: Directly passed as an argument to the train_model function or retrieving it using the optimizers parameters inside the train_model function with peak_lr = optimizer.param_groups[0]["lr"] which is the way implemented in the notebook and the book.

But in the code, the lr argument for optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.1) is never passed as lr=peak_lr thus defaulting to the lr AdamW's default of 1e-3 instead of the peak_lr = 5e-4 when we retrieve with peak_lr = optimizer.param_groups[0]["lr"]

  1. There is a gap for the gradients clipping, there's no clipping for the 1st step after the warmup ends when global_step = warmup_steps because the warmup stops at if global_step < warmup_steps: and the clipping starts at if global_step > warmup_steps:

I'm not sure that was intended to have no clipping when lr is at max because you also mentioned:

Applies gradient clipping after the warmup phase to avoid exploding gradients

Example of an output with warmup_steps = 18 and prints yes/no under the above conditions

review-notebook-app[bot] commented 4 weeks ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

rasbt commented 4 weeks ago

Thanks a lot for the PR. Yes, you are right, since we retrieve the peak LR from the optimizer in the training function, we should initialize the optimizer with the peak LR.

I agree with your update, but I'll set the peak_lr to 0.001 so that the loss and plots afterwards don't change too much (to make it a bit less confusing for the readers)

Regarding the second point, that's a great catch as well. I just double checked and you are correct!