hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
33.54k stars 4.11k forks source link

使用ds_z3_offload微调qwen2-72B模型,微调结束后,在保存期间中断并报错:“Sending process 27214 closing signal SIGTERM”。 请问是什么原因? #4782

Closed PhysicianHOYA closed 3 months ago

PhysicianHOYA commented 3 months ago

Reminder

System Info

Reproduction

model_name_or_path: Qwen2-72B

method

stage: sft do_train: true finetuning_type: lora lora_rank: 8 lora_alpha: 16 lora_target: all

ddp

ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json

dataset

dataset: 4-123 template: default cutoff_len: 1024 max_samples: 5000 data_seed: 42 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/Qwen2-72B-4-123-offload logging_steps: 10 save_steps: 100 plot_loss: true overwrite_output_dir: true save_total_limit: 1

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 5.0e-5 weight_decay: 0.01 num_train_epochs: 1 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: ture

Expected behavior

1 2 3 实验卡在这里好久了。希望可以帮忙解决以上问题。非常感谢!

Others

No response

hiyouga commented 3 months ago

-9 是内存溢出了

mengxz0203 commented 2 months ago

-9 是内存溢出了

这个有办法解决吗,还是只能加大内存,我现在是128G的内存,4*V100(32G)的GPU资源,使用ds_z3_offload微调gemma2-27b,也遇到了一样的问题