hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
33.05k stars 4.07k forks source link

长上下文OOM问题: llama3 8B+ flash attention2 +unsloth+ 4bit +100k上下文, A100-80G环境下仍然OOM #3638

Closed shumuha closed 5 months ago

shumuha commented 5 months ago

Reminder

Reproduction

参考https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison,使用 llama 8B的模型训练,但是显存占用和文档差异比较大,单卡环境cutoff_len =65536基本80G占满,cutoff_len=100k就OOM。问下有什么问题吗? 运行参数: python src/train.py \ --stage sft \ --do_train \ --model_name_or_path gradientai/Llama-3-8B-Instruct-Gradient-1048k \ --dataset summary_train \ --template llama3 \ --cutoff_len 102400 \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir output_models/sft \ --overwrite_cache \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 200 \ --learning_rate 1e-5 \ --num_train_epochs 5.0 \ --plot_loss \ --fp16 \ --flash_attn fa2\ --shift_attn \ --use_unsloth \ --quantization_bit 4 \ --overwrite_output_dir

输出: ... [INFO|modeling_utils.py:1494] 2024-05-08 20:23:20,149 >> Instantiating LlamaForCausalLM model under default dtype torch.float16. [INFO|configuration_utils.py:928] 2024-05-08 20:23:20,150 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128001 }

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:10<00:00, 2.70s/it] [INFO|modeling_utils.py:4170] 2024-05-08 20:23:43,299 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4178] 2024-05-08 20:23:43,300 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /mnt/workspace/public_models/gradientai/Llama-3-8B-Instruct-Gradient-1048k. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:881] 2024-05-08 20:23:43,304 >> loading configuration file /mnt/workspace/public_models/gradientai/Llama-3-8B-Instruct-Gradient-1048k/generation_config.json [INFO|configuration_utils.py:928] 2024-05-08 20:23:43,304 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": [ 128001, 128009 ], "max_length": 4096, "temperature": 0.6, "top_p": 0.9 }

[INFO|tokenization_utils_base.py:2085] 2024-05-08 20:23:53,177 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2085] 2024-05-08 20:23:53,177 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2085] 2024-05-08 20:23:53,177 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2085] 2024-05-08 20:23:53,177 >> loading file tokenizer_config.json [WARNING|logging.py:314] 2024-05-08 20:23:53,456 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|tokenization_utils_base.py:2085] 2024-05-08 20:23:53,459 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2085] 2024-05-08 20:23:53,459 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2085] 2024-05-08 20:23:53,459 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2085] 2024-05-08 20:23:53,459 >> loading file tokenizer_config.json [WARNING|logging.py:314] 2024-05-08 20:23:53,706 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:329] 2024-05-08 20:23:53,723 >> /mnt/workspace/public_models/gradientai/Llama-3-8B-Instruct-Gradient-1048k does not have a padding token! Will use pad_token = <|reserved_special_token250|>. 05/08/2024 20:23:54 - INFO - llmtuner.model.utils.checkpointing - Gradient checkpointing enabled. 05/08/2024 20:23:54 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA [WARNING|logging.py:329] 2024-05-08 20:23:54,268 >> Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. [WARNING|logging.py:329] 2024-05-08 20:23:54,268 >> Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. [WARNING|logging.py:329] 2024-05-08 20:23:54,268 >> Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. [WARNING|logging.py:329] 2024-05-08 20:23:54,269 >> Unsloth 2024.4 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers. 05/08/2024 20:23:54 - INFO - llmtuner.model.loader - trainable params: 3407872 || all params: 8033669120 || trainable%: 0.0424 [INFO|trainer.py:626] 2024-05-08 20:23:54,287 >> Using auto half precision backend 05/08/2024 20:23:54 - WARNING - llmtuner.extras.callbacks - Previous trainer log in this folder will be deleted. [WARNING|logging.py:329] 2024-05-08 20:23:54,418 >> ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \ /| Num examples = 2,376 | Num Epochs = 5 O^O/ \/ \ Batch size per device = 1 | Gradient Accumulation steps = 4 \ / Total batch size = 4 | Total steps = 2,970 "-__-" Number of trainable parameters = 3,407,872 0%| | 0/2970 [00:00<?, ?it/sTraceback (most recent call last): File "/mnt/workspace/LLM/LLaMA-Factory/src/train.py", line 14, in main() File "/mnt/workspace/LLM/LLaMA-Factory/src/train.py", line 5, in main run_exp() File "/mnt/workspace/LLM/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/mnt/workspace/LLM/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 73, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/home/pai/envs/llama/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "", line 361, in _fast_inner_training_loop File "/home/pai/envs/llama/lib/python3.10/site-packages/transformers/trainer.py", line 3138, in training_step loss = self.compute_loss(model, inputs) File "/home/pai/envs/llama/lib/python3.10/site-packages/transformers/trainer.py", line 3161, in compute_loss outputs = model(inputs) File "/home/pai/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/pai/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/home/pai/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward return model_forward(*args, **kwargs) File "/home/pai/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in call__ return convert_to_fp32(self.model_forward(*args, *kwargs)) File "/home/pai/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 789, in convert_to_fp32 return recursively_apply(_convert_to_fp32, tensor, test_type=_is_fp16_bf16_tensor) File "/home/pai/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 118, in recursively_apply { File "/home/pai/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 119, in k: recursively_apply( File "/home/pai/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply return func(data, args, **kwargs) File "/home/pai/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 781, in _convert_to_fp32 return tensor.float() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.93 GiB. GPU 0%|

Expected behavior

在100k时能在80G-A100上正常训练

System Info

Others

No response

hiyouga commented 5 months ago

去掉 shift_attn,lora_target 填 all