Very high memory usage for a small model

avcode-exe commented 1 month ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

Platform: Kaggle 2xT4

llamafactory version: 0.8.4.dev0
OS: Linux-5.15.154+-x86_64-with-glibc2.31
Python version: 3.10.13
PyTorch version: 2.1.2 (GPU)
Transformers version: 4.42.3
Datasets version: 2.20.0
Accelerate version: 0.32.0
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: Tesla T4
DeepSpeed version: 0.14.4
Bitsandbytes version: 0.43.3

Reproduction

config:

model_name_or_path: openchat/openchat-3.6-8b-20240522
quantization_bit: 4
quantization_method: bitsandbytes
template: openchat-3.6
rope_scaling: dynamic
flash_attn: fa2

stage: rm
do_train: True
finetuning_type: lora
lora_target: all
learning_rate: 5.0e-5
num_train_epochs: 1.0
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
lr_scheduler_type: cosine
max_grad_norm: 1.0
logging_steps: 5
save_steps: 100
warmup_ratio: 0.1
optim: adamw_torch
packing: False
fp16: True
include_num_input_tokens_seen: True
lora_rank: 8
lora_alpha: 16
lora_dropout: 0
use_rslora: True
deepspeed: /kaggle/working/LLaMA-Factory/examples/deepspeed/ds_z2_config.json

dataset: dpo_mix_en
dataset_dir: data
cutoff_len: 4096
max_samples: 100000

output_dir: saves/OpenChat3.6-8B-Chat/lora/train_2024-08-05-18-17-52
plot_loss: True
ddp_timeout: 180000000

Run !llamafactory-cli train <path-to-config-file>

Expected behavior

Expect the memory usage to be low as I'm loading it in 4 bit and training it in QLoRA with Deepspeed Zero 2 across 2 T4 GPU and a CPU.

Others

It took around 12 GB (6GB for each GPU) just to load the model in 4bit!

Memory usage before OOM (during training): Error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.47 GiB. GPU 0 has a total capacty of 14.74 GiB of which 202.12 MiB is free. Process 16375 has 14.54 GiB memory in use. Of the allocated memory 11.94 GiB is allocated by PyTorch, and 2.37 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I also tried with FSDP:

!CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
    --config_file examples/accelerate/fsdp_config.yaml \
    src/train.py /kaggle/working/config.yaml

Config file:

model_name_or_path: openchat/openchat-3.6-8b-20240522
quantization_bit: 4
quantization_method: bitsandbytes
template: openchat-3.6
rope_scaling: dynamic
flash_attn: fa2

stage: rm
do_train: True
finetuning_type: lora
lora_target: all
learning_rate: 1.0e-4
num_train_epochs: 2.0
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
lr_scheduler_type: cosine
max_grad_norm: 1.0
logging_steps: 5
save_steps: 100
warmup_ratio: 0.05
optim: adamw_bnb_8bit
packing: False
fp16: True
include_num_input_tokens_seen: True
lora_rank: 16
lora_alpha: 16
lora_dropout: 0
use_rslora: True
deepspeed: /kaggle/working/LLaMA-Factory/examples/deepspeed/ds_z2_config.json

dataset: dpo_mix_en
dataset_dir: data
cutoff_len: 4096
max_samples: 100000
val_size: 0.01
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 100

output_dir: saves/OpenChat3.6-8B-Chat/lora/train_2024-08-05-18-17-52
plot_loss: True
ddp_timeout: 180000000

avcode-exe commented 1 month ago

Just notice that n_gpu is set to be 1 on FSDP and Deepspeed. Now I'm even more confused...

neavo commented 1 month ago

试一下：

packing: True

这似乎是一个BUG，如果不开启 packing，多少显存都会用完

可能还跟其他的参数有关

hiyouga / LLaMA-Factory