jiahe7ay / MINI_LLM

This is a repository used by individuals to experiment and reproduce the pre-training process of LLM.
327 stars 52 forks source link

24G4090 预训练OOM #21

Closed Galloroc1 closed 5 months ago

Galloroc1 commented 5 months ago

卡是24G的4090。把模型大小和batch都调到很小了,但是跑完三四百个样本就OOM了,求助哪里出了问题。 torch版本:2.2.2 transformers版本:4.39.3 修改后的模型结构如下: QWenLMHeadModel( (transformer): QWenModel( (wte): Embedding(151936, 128) (drop): Dropout(p=0.0, inplace=False) (rotary_emb): RotaryEmbedding() (h): ModuleList( (0-7): 8 x QWenBlock( (ln_1): RMSNorm() (attn): QWenAttention( (c_attn): Linear(in_features=128, out_features=384, bias=True) (c_proj): Linear(in_features=128, out_features=128, bias=False) (attn_dropout): Dropout(p=0.0, inplace=False) ) (ln_2): RMSNorm() (mlp): QWenMLP( (w1): Linear(in_features=128, out_features=1024, bias=False) (w2): Linear(in_features=128, out_features=1024, bias=False) (c_proj): Linear(in_features=1024, out_features=128, bias=False) ) ) ) (ln_f): RMSNorm() ) (lm_head): Linear(in_features=128, out_features=151936, bias=False) )

QWen size: 42.6M parameters

训练参数如下: args= TrainingArguments( output_dir=pretrain_args.model_save_dir, per_device_train_batch_size=1, per_device_eval_batch_size=1, gradient_accumulation_steps=1, ddp_find_unused_parameters=False, gradient_checkpointing=True, num_train_epochs=1, weight_decay=0.1, warmup_steps=1000, learning_rate=5e-4, evaluation_strategy='steps', eval_steps=1000, save_steps=1000, save_strategy='steps', save_total_limit=3, report_to='tensorboard', optim="adamw_torch", lr_scheduler_type="cosine", bf16=True, logging_steps=5, log_level='info', logging_first_step=True, eval_accumulation_steps=1,

group_by_length=True,

# deepspeed='./ds_config_one_gpu.json',

)

jiahe7ay commented 5 months ago

你试一下降一下torch包和transformers包的版本? 你这种情况我还没遇到过

Galloroc1 commented 5 months ago

你试一下降一下torch包和transformers包的版本? 你这种情况我还没遇到过

降了没啥用,不确定是不是transformers的trainer里面的问题。差不多的参数量在baby-llama那个项目跑起来了。