使用ds_z3_offload微调qwen2-72B模型，微调结束后，在保存期间中断并报错：“Sending process 27214 closing signal SIGTERM”。请问是什么原因？

PhysicianHOYA commented 3 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.3.dev0
Platform: Linux-6.8.0-36-generic-x86_64-with-glibc2.39
Python version: 3.11.0
PyTorch version: 2.3.0+cu121 (GPU)
Transformers version: 4.42.3
Datasets version: 2.20.0
Accelerate version: 0.32.1
PEFT version: 0.11.1
TRL version: 0.9.4
GPU type: NVIDIA RTX A6000
DeepSpeed version: 0.14.0
Bitsandbytes version: 0.43.1

Reproduction

model_name_or_path: Qwen2-72B

method

stage: sft do_train: true finetuning_type: lora lora_rank: 8 lora_alpha: 16 lora_target: all

ddp

ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json

dataset

dataset: 4-123 template: default cutoff_len: 1024 max_samples: 5000 data_seed: 42 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/Qwen2-72B-4-123-offload logging_steps: 10 save_steps: 100 plot_loss: true overwrite_output_dir: true save_total_limit: 1

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 5.0e-5 weight_decay: 0.01 num_train_epochs: 1 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: ture

Expected behavior

实验卡在这里好久了。希望可以帮忙解决以上问题。非常感谢！

Others

No response

hiyouga commented 3 months ago

-9 是内存溢出了

mengxz0203 commented 2 months ago

-9 是内存溢出了

这个有办法解决吗，还是只能加大内存，我现在是128G的内存，4*V100(32G)的GPU资源，使用ds_z3_offload微调gemma2-27b，也遇到了一样的问题

hiyouga / LLaMA-Factory