Reminder

[X] I have read the README and searched the existing issues.

System Info

[2024-06-24 21:08:01,145] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)

llamafactory version: 0.8.3.dev0
Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version: 2.3.1+cu121 (GPU)
Transformers version: 4.41.2
Datasets version: 2.20.0
Accelerate version: 0.31.0
PEFT version: 0.11.1
TRL version: 0.9.4
GPU type: NVIDIA RTX A6000
DeepSpeed version: 0.14.0
Bitsandbytes version: 0.43.1

Reproduction

FORCE_TORCHRUN=1 llamafactory-cli train my_examples/train.yaml

model

model_name_or_path: Qwen2-57B

adapter_name_or_path:

quantization_bit: 4 double_quantization: true quantization_type: nf4

method

stage: sft do_train: true finetuning_type: lora lora_rank: 8 lora_alpha: 16 lora_target: all

ddp

ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z2_config-copy.json

dataset

dataset: alpaca_zh_demo template: qwen cutoff_len: 1024 max_samples: 100000 data_seed: 42 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/Qwen2-57B-lora logging_steps: 10 save_steps: 100 plot_loss: true overwrite_output_dir: true save_total_limit: 2 gradient_checkpointing: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 4.0e-5 num_train_epochs: 5 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true

Expected behavior

微调模型时单卡占用40G显存。希望能改进为双卡每张卡各占20G显存左右，而不是目前的双卡都各占40G显存。

Others

No response

hiyouga / LLaMA-Factory

qlora微调Qwen2-57B。使用单卡A6000显存占用40G，使用双卡A6000则是两张卡各占40G显存，请问是什么原因？ #4447