hiyouga / LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs
Apache License 2.0
25.26k stars 3.13k forks source link

qlora微调Qwen2-57B。使用单卡A6000显存占用40G,使用双卡A6000则是两张卡各占40G显存,请问是什么原因? #4447

Closed PhysicianHOYA closed 4 days ago

PhysicianHOYA commented 4 days ago

Reminder

System Info

[2024-06-24 21:08:01,145] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Reproduction

FORCE_TORCHRUN=1 llamafactory-cli train my_examples/train.yaml

model

model_name_or_path: Qwen2-57B

adapter_name_or_path:

quantization_bit: 4 double_quantization: true quantization_type: nf4

method

stage: sft do_train: true finetuning_type: lora lora_rank: 8 lora_alpha: 16 lora_target: all

ddp

ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z2_config-copy.json

dataset

dataset: alpaca_zh_demo template: qwen cutoff_len: 1024 max_samples: 100000 data_seed: 42 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/Qwen2-57B-lora logging_steps: 10 save_steps: 100 plot_loss: true overwrite_output_dir: true save_total_limit: 2 gradient_checkpointing: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 4.0e-5 num_train_epochs: 5 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true

Expected behavior

微调模型时单卡占用40G显存。希望能改进为双卡每张卡各占20G显存左右,而不是目前的双卡都各占40G显存。

Others

No response