8*A800 80G lora训练qwen2-72B模型内存占用异常

999wwx commented 4 days ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.0
Platform: Linux-4.18.0-193.14.2.el8_2.x86_64-x86_64-with-glibc2.31
Python version: 3.10.13
PyTorch version: 2.1.2+cu121 (GPU)
Transformers version: 4.41.2
Datasets version: 2.16.0
Accelerate version: 0.30.1
PEFT version: 0.11.1
TRL version: 0.8.6
GPU type: NVIDIA A800 80GB PCIe
DeepSpeed version: 0.14.0

Reproduction

model

model_name_or_path: /shard/Qwen2-72B-Instruct

method

stage: sft do_train: true finetuning_type: lora lora_target: all

ddp

ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json

dataset

dataset: lima,self_cognition_replace template: qwen cutoff_len: 8192 max_samples: 1000000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/Qwen2-72B-Instruct/lora/sft_lima logging_steps: 4 save_steps: 16 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: false

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 4

保存模型时内存占用：这里的主进程为什么会占用这么多内存呢？容器配置了720G内存，保存最后一步模型的时候，内存溢出，容器重启了 ps：数据数量1000+条，最长的为3000+token

Expected behavior

No response

Others

No response

yaya159456 commented 4 days ago

我也有个类似的问题，lora微调qwen的时候，数据集不大，但是总是GPU报错OOM 就失败了

yaya159456 commented 4 days ago

per_device_train_batch_size 都等于1了，为啥还能OOM，我是24G内存微调qwen2-7B-instruct 这怎么算能OOM啊？

hiyouga / LLaMA-Factory