Closed 999wwx closed 2 days ago
llamafactory version: 0.8.0
llamafactory
Platform: Linux-4.18.0-193.14.2.el8_2.x86_64-x86_64-with-glibc2.31
Python version: 3.10.13
PyTorch version: 2.1.2+cu121 (GPU)
Transformers version: 4.41.2
Datasets version: 2.16.0
Accelerate version: 0.30.1
PEFT version: 0.11.1
TRL version: 0.8.6
GPU type: NVIDIA A800 80GB PCIe
DeepSpeed version: 0.14.0
model_name_or_path: /shard/Qwen2-72B-Instruct
stage: sft do_train: true finetuning_type: lora lora_target: all
ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json
dataset: lima,self_cognition_replace template: qwen cutoff_len: 8192 max_samples: 1000000 overwrite_cache: true preprocessing_num_workers: 16
output_dir: saves/Qwen2-72B-Instruct/lora/sft_lima logging_steps: 4 save_steps: 16 plot_loss: true overwrite_output_dir: true
per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: false
val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 4
保存模型时内存占用: 这里的主进程为什么会占用这么多内存呢?容器配置了720G内存,保存最后一步模型的时候,内存溢出,容器重启了 ps:数据数量1000+条,最长的为3000+token
No response
我也有个类似的问题,lora微调qwen的时候,数据集不大,但是总是GPU报错OOM 就失败了
per_device_train_batch_size 都等于1了,为啥还能OOM,我是24G内存 微调qwen2-7B-instruct 这怎么算能OOM啊?
Reminder
System Info
llamafactory
version: 0.8.0Platform: Linux-4.18.0-193.14.2.el8_2.x86_64-x86_64-with-glibc2.31
Python version: 3.10.13
PyTorch version: 2.1.2+cu121 (GPU)
Transformers version: 4.41.2
Datasets version: 2.16.0
Accelerate version: 0.30.1
PEFT version: 0.11.1
TRL version: 0.8.6
GPU type: NVIDIA A800 80GB PCIe
DeepSpeed version: 0.14.0
Reproduction
model
model_name_or_path: /shard/Qwen2-72B-Instruct
method
stage: sft do_train: true finetuning_type: lora lora_target: all
ddp
ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json
dataset
dataset: lima,self_cognition_replace template: qwen cutoff_len: 8192 max_samples: 1000000 overwrite_cache: true preprocessing_num_workers: 16
output
output_dir: saves/Qwen2-72B-Instruct/lora/sft_lima logging_steps: 4 save_steps: 16 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: false
eval
val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 4
保存模型时内存占用:
这里的主进程为什么会占用这么多内存呢?容器配置了720G内存,保存最后一步模型的时候,内存溢出,容器重启了
ps:数据数量1000+条,最长的为3000+token
Expected behavior
No response
Others
No response