karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
36.41k stars 5.71k forks source link

Train/Val Loss Issues when training GPT-2 from OWT #303

Open JustinKunzi opened 1 year ago

JustinKunzi commented 1 year ago

I'm currently testing a new system of 4x NVIDIA A4500's and had no problem running the Shakespeare training set and the outcome was as intended. My system is currently training the GPT 124M Model on the OWT dataset and everything seemed perfectly fine until about 20 steps in when our loss value increased until reaching a value of ~7.81 and its been hovering within ~0.5 of that value for around 65 steps now. Checking the Loss graph on the readme this seems highly unusual and could be indicating some form of problem. Any ideas what could be happening? Here is the config changes that were made to efficiently utilize the most VRAM to reduce total training time.

batch_size = 16 block_size = 1024 gradient_accumulation_steps = 5 * 4

Gradient steps were multiplied by 4 since our node only has 4 gpu's and the batch size was increased to 16 since the original config at 12 was utilizing only 60% of GPU memory and an increase to 16 changed that to around 94% and sped up the iteration speed. This is the ONLY code that was changed before starting training. I've also included the loss graphs from wandb to help visualize the problem. W B Chart 6_17_2023, 9_18_27 AM W B Chart 6_17_2023, 9_18_37 AM

If anyone could explain what could possibly be going on that would be much appreciated. Thank you in advance.

JustinKunzi commented 1 year ago

Update: Its day 5/11 of training the GPT-2 on OWT our loss values have still not moved much from the original jump. Our training and loss values are still almost identical and I'm still not entirely sure what is going on. I'm certain that its not overfitting as it occurred so early in training as well as our validation and training losses are so close never separating more than ~0.2 away from each other. I've included both updated loss graphs to again help visualize what is going on.

W B Chart 6_20_2023, 3_41_09 PM W B Chart 6_20_2023, 3_35_31 PM

echosprint commented 1 year ago

I encounter the same issue, the train loss suddenly jump from ~3.0 to ~7.0, and then stay around 7.0 for a long time. Did you figure out the issue?

JustinKunzi commented 1 year ago

I encounter the same issue, the train loss suddenly jump from ~3.0 to ~7.0, and then stay around 7.0 for a long time. Did you figure out the issue?

Unfortunately not. The training finished having stayed within 0.5 of 7.0 nearly the entire time. Its odd that you have the exact same loss value problem. Perhaps something with the dataset then? I see a lot of people on here are doing the simple Shakespeare or having other issues. I haven't read of someone having a successful attempt yet at training the full OWT dataset. I'm thinking about training on a new dataset in the near future when I get some free time. Hopefully then the issue doesn't persist.

iainmelvin commented 1 year ago

This happened to me too, with one A100 on openwebtext with default parameters. If anyone knows what might be going on I am all ears! ( I also suspect a data issue )

jiacheng-ye commented 1 year ago

Same issue for me. I'm using 4*A100 80G on openwebtext, I change the batch 12 -> 24 and gradient_accumulation_steps = 5*8 -> 5*4.

image
gkucsko commented 1 year ago

is this bfloat16? do you see the same with float32?

jiacheng-ye commented 1 year ago

I use fp16, disable the flash attention works for me

ziqi-zhang commented 1 year ago

I'm training GPT-2 (124MB) on OpenWebText and I encountered the same problem. Did you figure out it?

karan78tp commented 1 year ago

Hi, I am using around 42GB GPU. dataset = open web text (20GB). batch size = 16 i am training it for 8th day now. Screenshot_2023-08-08-13-06-40-82_40deb401b9ffe8e1df2f1cc5ba480b12 val Loss at 2.95

ziqi-zhang commented 1 year ago

@karan78tp Can you share the hyperparameter setting in the config/train_gpt2.py? Thanks!

karan78tp commented 1 year ago

@ziqi-zhang batch size is 20.

wandb_run_name='gpt2-124M' batch_size = 20 block_size = 1024 gradient_accumulation_steps = 5 * 8 max_iters = 600000 lr_decay_iters = 600000 eval_interval = 1000 eval_iters = 200 log_interval = 10 weight_decay = 1e-1

ziqi-zhang commented 1 year ago

@karan78tp Thanks! I was also wondering how many GPUs do you use? Do you use only one GPU?

karan78tp commented 1 year ago

i am using two GPU Quadro RTX 6000 each 24576MiB

ziqi-zhang commented 1 year ago

Thanks!

Tony-Hou commented 1 year ago

It is gradient_accumulation_steps cause the issuse, keep gradient_accumulation_steps = 8 train is ok!

karan78tp commented 1 year ago

converged configurations : batch size= 20 ,gradient steps = 40 original config batch size = 12 , gradient steps = 40 earlier configurations which failed to converge. i) batch size = 16 , gradient steps = 20 ii) batch size = 24 , gradient steps = 40 i guess may be the ratio of gradient/batch size should be more.

TahaBinhuraib commented 1 year ago

@karan78tp this is with 2 gpus right?

zzw-zwzhang commented 8 months ago

Hi, I am using around 42GB GPU. dataset = open web text (20GB). batch size = 16 i am training it for 8th day now. Screenshot_2023-08-08-13-06-40-82_40deb401b9ffe8e1df2f1cc5ba480b12 val Loss at 2.95

Can you share your wandb log?

iminfine commented 2 months ago

Disable flash attention helps converge.