hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
30.81k stars 3.8k forks source link

Very high memory usage for a small model #5077

Open avcode-exe opened 1 month ago

avcode-exe commented 1 month ago

Reminder

System Info

Platform: Kaggle 2xT4

Reproduction

config:

model_name_or_path: openchat/openchat-3.6-8b-20240522
quantization_bit: 4
quantization_method: bitsandbytes
template: openchat-3.6
rope_scaling: dynamic
flash_attn: fa2

stage: rm
do_train: True
finetuning_type: lora
lora_target: all
learning_rate: 5.0e-5
num_train_epochs: 1.0
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
lr_scheduler_type: cosine
max_grad_norm: 1.0
logging_steps: 5
save_steps: 100
warmup_ratio: 0.1
optim: adamw_torch
packing: False
fp16: True
include_num_input_tokens_seen: True
lora_rank: 8
lora_alpha: 16
lora_dropout: 0
use_rslora: True
deepspeed: /kaggle/working/LLaMA-Factory/examples/deepspeed/ds_z2_config.json

dataset: dpo_mix_en
dataset_dir: data
cutoff_len: 4096
max_samples: 100000

output_dir: saves/OpenChat3.6-8B-Chat/lora/train_2024-08-05-18-17-52
plot_loss: True
ddp_timeout: 180000000

Run !llamafactory-cli train <path-to-config-file>

Expected behavior

Expect the memory usage to be low as I'm loading it in 4 bit and training it in QLoRA with Deepspeed Zero 2 across 2 T4 GPU and a CPU.

Others

It took around 12 GB (6GB for each GPU) just to load the model in 4bit!

Memory usage before OOM (during training): image Error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.47 GiB. GPU 0 has a total capacty of 14.74 GiB of which 202.12 MiB is free. Process 16375 has 14.54 GiB memory in use. Of the allocated memory 11.94 GiB is allocated by PyTorch, and 2.37 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I also tried with FSDP:

!CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
    --config_file examples/accelerate/fsdp_config.yaml \
    src/train.py /kaggle/working/config.yaml

Config file:

model_name_or_path: openchat/openchat-3.6-8b-20240522
quantization_bit: 4
quantization_method: bitsandbytes
template: openchat-3.6
rope_scaling: dynamic
flash_attn: fa2

stage: rm
do_train: True
finetuning_type: lora
lora_target: all
learning_rate: 1.0e-4
num_train_epochs: 2.0
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
lr_scheduler_type: cosine
max_grad_norm: 1.0
logging_steps: 5
save_steps: 100
warmup_ratio: 0.05
optim: adamw_bnb_8bit
packing: False
fp16: True
include_num_input_tokens_seen: True
lora_rank: 16
lora_alpha: 16
lora_dropout: 0
use_rslora: True
deepspeed: /kaggle/working/LLaMA-Factory/examples/deepspeed/ds_z2_config.json

dataset: dpo_mix_en
dataset_dir: data
cutoff_len: 4096
max_samples: 100000
val_size: 0.01
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 100

output_dir: saves/OpenChat3.6-8B-Chat/lora/train_2024-08-05-18-17-52
plot_loss: True
ddp_timeout: 180000000
avcode-exe commented 1 month ago

Just notice that n_gpu is set to be 1 on FSDP and Deepspeed. Now I'm even more confused... image

neavo commented 1 month ago

试一下:

packing: True

这似乎是一个BUG,如果不开启 packing,多少显存都会用完

可能还跟其他的参数有关