hiyouga / LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs
Apache License 2.0
25.26k stars 3.13k forks source link

8*A800 80G lora训练qwen2-72B模型 内存占用异常 #4453

Closed 999wwx closed 2 days ago

999wwx commented 4 days ago

Reminder

System Info

Reproduction

model

model_name_or_path: /shard/Qwen2-72B-Instruct

method

stage: sft do_train: true finetuning_type: lora lora_target: all

ddp

ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json

dataset

dataset: lima,self_cognition_replace template: qwen cutoff_len: 8192 max_samples: 1000000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/Qwen2-72B-Instruct/lora/sft_lima logging_steps: 4 save_steps: 16 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: false

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 4

保存模型时内存占用: image 这里的主进程为什么会占用这么多内存呢?容器配置了720G内存,保存最后一步模型的时候,内存溢出,容器重启了 ps:数据数量1000+条,最长的为3000+token

Expected behavior

No response

Others

No response

yaya159456 commented 4 days ago

我也有个类似的问题,lora微调qwen的时候,数据集不大,但是总是GPU报错OOM 就失败了

yaya159456 commented 4 days ago

per_device_train_batch_size 都等于1了,为啥还能OOM,我是24G内存 微调qwen2-7B-instruct 这怎么算能OOM啊?