Closed PhysicianHOYA closed 3 months ago
llamafactory
model_name_or_path: Qwen2-72B
stage: sft do_train: true finetuning_type: lora lora_rank: 8 lora_alpha: 16 lora_target: all
ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json
dataset: 4-123 template: default cutoff_len: 1024 max_samples: 5000 data_seed: 42 overwrite_cache: true preprocessing_num_workers: 16
output_dir: saves/Qwen2-72B-4-123-offload logging_steps: 10 save_steps: 100 plot_loss: true overwrite_output_dir: true save_total_limit: 1
per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 5.0e-5 weight_decay: 0.01 num_train_epochs: 1 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: ture
实验卡在这里好久了。希望可以帮忙解决以上问题。非常感谢!
No response
-9 是内存溢出了
这个有办法解决吗,还是只能加大内存,我现在是128G的内存,4*V100(32G)的GPU资源,使用ds_z3_offload微调gemma2-27b,也遇到了一样的问题
Reminder
System Info
llamafactory
version: 0.8.3.dev0Reproduction
model_name_or_path: Qwen2-72B
method
stage: sft do_train: true finetuning_type: lora lora_rank: 8 lora_alpha: 16 lora_target: all
ddp
ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json
dataset
dataset: 4-123 template: default cutoff_len: 1024 max_samples: 5000 data_seed: 42 overwrite_cache: true preprocessing_num_workers: 16
output
output_dir: saves/Qwen2-72B-4-123-offload logging_steps: 10 save_steps: 100 plot_loss: true overwrite_output_dir: true save_total_limit: 1
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 5.0e-5 weight_decay: 0.01 num_train_epochs: 1 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: ture
Expected behavior
实验卡在这里好久了。希望可以帮忙解决以上问题。非常感谢!
Others
No response