karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
36.29k stars 5.67k forks source link

Is this loss curve normal #468

Open banyan-god opened 5 months ago

banyan-god commented 5 months ago

W B Chart 3_27_2024, 9 55 33 AM I am running on 2x 4090 , updated gpu to 2 instead of 8 in gradient_accumulation_steps

 more train_gpt2.py 
# config for training GPT-2 (124M) down to very nice loss of ~2.85 on 1 node of 8X A100 40GB
# launch as the following (e.g. in a screen session) and wait ~5 days:
# $ torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py

wandb_log = True
wandb_project = 'owt'
wandb_run_name='gpt2-124M'

# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 20
block_size = 1024
gradient_accumulation_steps = 5 * **2**

# this makes total number of tokens be 300B
max_iters = 600000
lr_decay_iters = 600000

# eval stuff
eval_interval = 1000
eval_iters = 200
log_interval = 10

# weight decay
weight_decay = 1e-1
VatsaDev commented 5 months ago

That is a crazy high learning rate, could be the issue, also check your data, and check val loss for overfitting

banyan-god commented 5 months ago

Is it though ? that was the value on train.py, Either way tried few runs but no luck W B Chart 4_3_2024, 1 33 41 PM W B Chart 4_3_2024, 1 33 49 PM

yalding commented 5 months ago

I have the same issue with a single GPU 4060Ti (16G).

image

yalding commented 5 months ago

@banyan-god, did you try to match the total batch size of ~0.5M? batch_size num_of_gpus gradaccum > 500.

Your current total batch size is 40% of the original total batch size. It might impact you due to the stochastic nature of the training.

PS: my single GPU training is too slow for that. :(

bigsnarfdude commented 5 months ago
in the README.md loss curves expected

| model | params | train loss | val loss |
| ------| ------ | ---------- | -------- |
| gpt2 | 124M         | 3.11  | 3.12     |
| gpt2-medium | 350M  | 2.85  | 2.84     |
| gpt2-large | 774M   | 2.66  | 2.67     |
| gpt2-xl | 1558M     | 2.56  | 2.54     |
banyan-god commented 5 months ago

@banyan-god, did you try to match the total batch size of ~0.5M? batch_size num_of_gpus gradaccum > 500.

Your current total batch size is 40% of the original total batch size. It might impact you due to the stochastic nature of the training.

PS: my single GPU training is too slow for that. :(

Yes i did I used 2 for gpu instead of 8

# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 16
block_size = 1024
gradient_accumulation_steps = 5 * 2

@yalding

yalding commented 5 months ago

@banyan-god , you need to keep the gradient_accumulation_steps at 40, instead of 10 to maintain the "total batch size ~0.5M" as cited in the original comment from karpathy.

My GPU was fried after 1w training, and have to stop. But with total batch size ~0.5M, I am able to break previous lowest loss I had.

@bigsnarfdude , you comment is not relevant as that were the loss for loading gpt-2 from openAI if I read the doc correctly.

bigsnarfdude commented 5 months ago

i've trained 124m, medium, and large using both openwebtext and red-pajamas datasets. your iterations should be around 100k and you will reach same training and val loss as the gpt2 loaded from weights. for example. 124m with grad_acc=5 and batch_size=12 and standard LR provided in repo you get a pretrained model from scratch that is very similar to the posted chart

banyan-god commented 5 months ago

@yalding so started another job today with ~572.06M Parameters with grad accumulation of 40 as you suggested. Will report back on progress if it explodes

always_save_checkpoint:true
backend:"nccl"
batch_size:5
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006
log_interval:10
lr_decay_iters:600,000
max_iters:600,000
min_lr:0.00006
n_embd:1,600
n_head:16
n_layer:16
out_dir:"out"
wandb_log:true
wandb_project:"owt"
wandb_run_name:"gpt2"
warmup_iters:2,000
weight_decay:0.1
yalding commented 5 months ago

@banyan-god you are setting batch size to 5. This again will reduce total batch size to 5501024= 0.25M which is half of the recommended 0.5M total batch size...

banyan-god commented 5 months ago

@yalding ok rolled back all the changes to hyper parameter and just running them with following torchrun --standalone --nproc_per_node=2 train.py config/train_gpt2.py

always_save_checkpoint:true
backend:"nccl"
batch_size:12
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006
banyan-god commented 5 months ago

@yalding unfortunately that didnt work either

Screenshot 2024-04-17 at 6 40 12 AM
yalding commented 5 months ago

Did you validate the config logged in wandb? My last run config:

always_save_checkpoint:false backend:"nccl" batch_size:11 beta1:0.9 beta2:0.95 bias:false block_size:1,024 compile:true dataset:"openwebtext" decay_lr:true device:"cuda" dropout:0 dtype:"bfloat16" eval_interval:100 eval_iters:50 eval_only:false grad_clip:1 gradient_accumulation_steps:50 init_from:"resume" learning_rate:0.0006 log_interval:1 lr_decay_iters:600,000 max_iters:600,000 min_lr:0.00006 n_embd:768 n_head:12 n_layer:12 out_dir:"out" wandb_log:true wandb_project:"owt" wandb_run_name:"gpt2-124M-original-seed3000-bs550-resume" warmup_iters:2,000 weight_decay:0.1

final train/val loss until it crashed my GPU:

train/loss:3.237154483795166 val/loss:3.248462915420532

banyan-god commented 5 months ago

@yalding

always_save_checkpoint:true
backend:"nccl"
batch_size:12
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006
log_interval:10
lr_decay_iters:600,000
max_iters:600,000
min_lr:0.00006
n_embd:768
n_head:12
n_layer:12
out_dir:"out"
wandb_log:true
wandb_project:"sifra"
wandb_run_name:"sifra-124M"
warmup_iters:2,000
weight_decay:0.1
banyan-god commented 5 months ago

I am also wondering possibly something to do with pytorch version or openweb text

seanxwzhang commented 1 month ago

I'm encountering the same issue, @banyan-god did you eventually figure out a way to resolve this?

banyan-god commented 1 month ago

@seanxwzhang I want to say it is combination of tokenizer and dataset. When i switched over to gpt4 tokenizer problem disappeared.

seanxwzhang commented 1 month ago

Interesting, in my case it was fixed by not using bf16 but fp16. Surprised that tokenizer can have an effect on what looks like a numerical issue (or perhaps it isn't)

mattgorb commented 1 week ago

@seanxwzhang @banyan-god Were you able to converge your training to 2.9 on GPT2 Small? Did the loss go to NaN or explode back up? I am encountering the same issue, and have tried both of your solutions (fp16 and gpt4 tokenizer).

If possible, please let me know what versions of torch you are using.

seanxwzhang commented 1 week ago

@seanxwzhang @banyan-god Were you able to converge your training to 2.9 on GPT2 Small? Did the loss go to NaN or explode back up? I am encountering the same issue, and have tried both of your solutions (fp16 and gpt4 tokenizer).

If possible, please let me know what versions of torch you are using.

I was using torch 2.3.0