Open banyan-god opened 5 months ago
That is a crazy high learning rate, could be the issue, also check your data, and check val loss for overfitting
Is it though ? that was the value on train.py, Either way tried few runs but no luck
I have the same issue with a single GPU 4060Ti (16G).
@banyan-god, did you try to match the total batch size of ~0.5M? batch_size num_of_gpus gradaccum > 500.
Your current total batch size is 40% of the original total batch size. It might impact you due to the stochastic nature of the training.
PS: my single GPU training is too slow for that. :(
in the README.md loss curves expected
| model | params | train loss | val loss |
| ------| ------ | ---------- | -------- |
| gpt2 | 124M | 3.11 | 3.12 |
| gpt2-medium | 350M | 2.85 | 2.84 |
| gpt2-large | 774M | 2.66 | 2.67 |
| gpt2-xl | 1558M | 2.56 | 2.54 |
@banyan-god, did you try to match the total batch size of ~0.5M? batch_size num_of_gpus gradaccum > 500.
Your current total batch size is 40% of the original total batch size. It might impact you due to the stochastic nature of the training.
PS: my single GPU training is too slow for that. :(
Yes i did I used 2 for gpu instead of 8
# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 16
block_size = 1024
gradient_accumulation_steps = 5 * 2
@yalding
@banyan-god , you need to keep the gradient_accumulation_steps at 40, instead of 10 to maintain the "total batch size ~0.5M" as cited in the original comment from karpathy.
My GPU was fried after 1w training, and have to stop. But with total batch size ~0.5M, I am able to break previous lowest loss I had.
@bigsnarfdude , you comment is not relevant as that were the loss for loading gpt-2 from openAI if I read the doc correctly.
i've trained 124m, medium, and large using both openwebtext and red-pajamas datasets. your iterations should be around 100k and you will reach same training and val loss as the gpt2 loaded from weights. for example. 124m with grad_acc=5 and batch_size=12 and standard LR provided in repo you get a pretrained model from scratch that is very similar to the posted chart
@yalding so started another job today with ~572.06M Parameters with grad accumulation of 40 as you suggested. Will report back on progress if it explodes
always_save_checkpoint:true
backend:"nccl"
batch_size:5
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006
log_interval:10
lr_decay_iters:600,000
max_iters:600,000
min_lr:0.00006
n_embd:1,600
n_head:16
n_layer:16
out_dir:"out"
wandb_log:true
wandb_project:"owt"
wandb_run_name:"gpt2"
warmup_iters:2,000
weight_decay:0.1
@banyan-god you are setting batch size to 5. This again will reduce total batch size to 5501024= 0.25M which is half of the recommended 0.5M total batch size...
@yalding ok rolled back all the changes to hyper parameter and just running them with following
torchrun --standalone --nproc_per_node=2 train.py config/train_gpt2.py
always_save_checkpoint:true
backend:"nccl"
batch_size:12
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006
@yalding unfortunately that didnt work either
Did you validate the config logged in wandb? My last run config:
always_save_checkpoint:false backend:"nccl" batch_size:11 beta1:0.9 beta2:0.95 bias:false block_size:1,024 compile:true dataset:"openwebtext" decay_lr:true device:"cuda" dropout:0 dtype:"bfloat16" eval_interval:100 eval_iters:50 eval_only:false grad_clip:1 gradient_accumulation_steps:50 init_from:"resume" learning_rate:0.0006 log_interval:1 lr_decay_iters:600,000 max_iters:600,000 min_lr:0.00006 n_embd:768 n_head:12 n_layer:12 out_dir:"out" wandb_log:true wandb_project:"owt" wandb_run_name:"gpt2-124M-original-seed3000-bs550-resume" warmup_iters:2,000 weight_decay:0.1
final train/val loss until it crashed my GPU:
train/loss:3.237154483795166 val/loss:3.248462915420532
@yalding
always_save_checkpoint:true
backend:"nccl"
batch_size:12
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006
log_interval:10
lr_decay_iters:600,000
max_iters:600,000
min_lr:0.00006
n_embd:768
n_head:12
n_layer:12
out_dir:"out"
wandb_log:true
wandb_project:"sifra"
wandb_run_name:"sifra-124M"
warmup_iters:2,000
weight_decay:0.1
I am also wondering possibly something to do with pytorch version or openweb text
I'm encountering the same issue, @banyan-god did you eventually figure out a way to resolve this?
@seanxwzhang I want to say it is combination of tokenizer and dataset. When i switched over to gpt4 tokenizer problem disappeared.
Interesting, in my case it was fixed by not using bf16 but fp16. Surprised that tokenizer can have an effect on what looks like a numerical issue (or perhaps it isn't)
@seanxwzhang @banyan-god Were you able to converge your training to 2.9 on GPT2 Small? Did the loss go to NaN or explode back up? I am encountering the same issue, and have tried both of your solutions (fp16 and gpt4 tokenizer).
If possible, please let me know what versions of torch you are using.
@seanxwzhang @banyan-god Were you able to converge your training to 2.9 on GPT2 Small? Did the loss go to NaN or explode back up? I am encountering the same issue, and have tried both of your solutions (fp16 and gpt4 tokenizer).
If possible, please let me know what versions of torch you are using.
I was using torch 2.3.0
I am running on 2x 4090 , updated gpu to 2 instead of 8 in gradient_accumulation_steps