hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
30.71k stars 3.78k forks source link

A800 8卡GPU LLAMA3 8B lora 训练一段时间后总是会OOM #3631

Closed 807660937 closed 4 months ago

807660937 commented 4 months ago

Reminder

Reproduction

USE_MODELSCOPE_HUB=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.run \
        --nproc_per_node 8 \
        --nnodes 1 \
        --standalone \
        src/train.py examples/water/0508_wa_llama3_8b_lora_sft.yaml 
# model
model_name_or_path: LLM-Research/Meta-Llama-3-8B-Instruct

# method
stage: sft
do_train: true
finetuning_type: lora
lora_target: q_proj,v_proj

# ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json

# dataset
dataset: identity_water,alpaca_gpt4_en,alpaca_gpt4_zh,lima,glaive_toolcall,oaast_sft_zh,ruozhiba,identity_water
template: llama3
cutoff_len: 8192
max_samples: 
val_size: 0.01
overwrite_cache: true
preprocessing_num_workers: 32

# output
output_dir: saves/LLM-Research/Meta-Llama-3-8B-Instruct/lora/sft_wa_0508
logging_steps: 4
save_steps: 200
plot_loss: true
overwrite_output_dir: true

# train
per_device_train_batch_size: 6
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
bf16: true

# eval
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 100

两次实验几乎稳定复现,看着疑似显存使用一直在增加? 15%|█▍ | 129/882 [40:17<3:41:05, 17.62s/it]Traceback (most recent call last): 8%|▊ | 55/663 [24:56<5:05:23, 30.14s/it]Traceback (most recent call last):

训练一段时间内会稳定出现OOM image

Expected behavior

No response

System Info

No response

Others

No response

hiyouga commented 4 months ago

降低 batchsize,因为序列长度不一样所以显存会有波动

807660937 commented 4 months ago

好,我试一下,所以cut_seqlen是只对超长做截断?所以没做padding对吧

hiyouga commented 4 months ago

padding 会显著减慢训练速度