RTX4090双卡 Baichuan2-13B-Chat微调内存不足

leoterry-ulrica commented 10 months ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

accelerate launch src/train_bash.py \
    --stage sft \
    --do_train True \
    --model_name_or_path /root/autodl-tmp/Baichuan2-13B-Chat \
    --finetuning_type lora \
    --template baichuan2 \
    --dataset_dir data \
    --dataset self_cognition \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 30.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --neftune_noise_alpha 0 \
    --lora_rank 8 \
    --lora_dropout 0.1 \
    --lora_target W_pack \
    --output_dir saves/Baichuan2-13B-Chat/lora/train_2023-12-22-17-54-04 \
    --bf16 True \
    --plot_loss True \
    --per_device_train_batch_size 1

Expected behavior

每张RTX4090卡，24G显存，但依然提示内存不足：

12/23/2023 00:18:21 - INFO - llmtuner.data.loader - Loading dataset self_cognition.json...
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
[WARNING|modeling_utils.py:2045] 2023-12-23 00:18:22,961 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Loading checkpoint shards:   0%|                                                                                                                   | 0/3 [00:00<?, ?it/s]You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:14<00:00,  4.77s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:14<00:00,  4.83s/it]
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
12/23/2023 00:18:37 - INFO - llmtuner.model.utils - Gradient checkpointing enabled.
12/23/2023 00:18:37 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
12/23/2023 00:18:37 - INFO - llmtuner.model.loader - trainable params: 6553600 || all params: 13903221760 || trainable%: 0.0471
[WARNING|modeling_utils.py:2045] 2023-12-23 00:18:37,766 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
12/23/2023 00:18:37 - INFO - llmtuner.model.utils - Gradient checkpointing enabled.
12/23/2023 00:18:37 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
12/23/2023 00:18:37 - INFO - llmtuner.model.loader - trainable params: 6553600 || all params: 13903221760 || trainable%: 0.0471
input_ids:
[195, 16829, 196, 28850, 65, 6461, 4014, 19438, 92574, 65, 1558, 92746, 4014, 92343, 37093, 3000, 92574, 92311, 37166, 12275, 92311, 18183, 65, 52160, 4152, 93082, 66, 92676, 19516, 92402, 11541, 92549, 29949, 68, 2]
inputs:
 <reserved_106>你好<reserved_107>您好，我是 <NAME>，一个由 <AUTHOR> 开发的 AI 助手，很高兴认识您。请问我能为您做些什么？</s>
label_ids:
[-100, -100, -100, 28850, 65, 6461, 4014, 19438, 92574, 65, 1558, 92746, 4014, 92343, 37093, 3000, 92574, 92311, 37166, 12275, 92311, 18183, 65, 52160, 4152, 93082, 66, 92676, 19516, 92402, 11541, 92549, 29949, 68, 2]
labels:
 您好，我是 <NAME>，一个由 <AUTHOR> 开发的 AI 助手，很高兴认识您。请问我能为您做些什么？</s>
Traceback (most recent call last):
  File "/root/autodl-tmp/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/root/autodl-tmp/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/root/autodl-tmp/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/root/autodl-tmp/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 53, in run_sft
    trainer = CustomSeq2SeqTrainer(
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 56, in __init__
    super().__init__(
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 456, in __init__
    self._move_model_to_device(model, args.device)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device
    model = model.to(device)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 23.65 GiB total capacity; 23.12 GiB already allocated; 8.06 MiB free; 23.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/root/autodl-tmp/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/root/autodl-tmp/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/root/autodl-tmp/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/root/autodl-tmp/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 53, in run_sft
    trainer = CustomSeq2SeqTrainer(
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 56, in __init__
    super().__init__(
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 456, in __init__
    self._move_model_to_device(model, args.device)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device
    model = model.to(device)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 1; 23.65 GiB total capacity; 23.12 GiB already allocated; 8.06 MiB free; 23.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 841) of binary: /root/miniconda3/envs/llama_factory/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/llama_factory/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.36.2
- Platform: Linux-5.15.0-83-generic-x86_64-with-glibc2.31
- Python version: 3.10.13
- Huggingface_hub version: 0.20.1
- Safetensors version: 0.4.1
- Accelerate version: 0.25.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Others

No response

hiyouga commented 10 months ago

accelerate 并不能减少每张卡上的显存占用使用 deepspeed zero3 将模型分到多张卡上 https://github.com/xverse-ai/XVERSE-65B?tab=readme-ov-file#%E6%A8%A1%E5%9E%8B%E5%BE%AE%E8%B0%83

leoterry-ulrica commented 10 months ago

accelerate 并不能减少每张卡上的显存占用使用 deepspeed zero3 将模型分到多张卡上 https://github.com/xverse-ai/XVERSE-65B?tab=readme-ov-file#%E6%A8%A1%E5%9E%8B%E5%BE%AE%E8%B0%83

依然会提示内存不足。

训练脚本：

deepspeed --num_gpus 2 src/train_bash.py \
    --deepspeed deep_speed.json \
    --stage sft \
    --do_train True \
    --model_name_or_path /root/autodl-tmp/Baichuan2-13B-Chat \
    --finetuning_type lora \
    --template baichuan2 \
    --dataset_dir data \
    --dataset self_cognition \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 3 \
    --max_samples 100000 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --neftune_noise_alpha 0 \
    --lora_rank 8 \
    --lora_dropout 0.1 \
    --lora_target W_pack \
    --output_dir saves/Baichuan2-13B-Chat/lora/train_2023-12-22-17-54-04 \
    --bf16 True \
    --plot_loss True

deep_speed.json内容：

{
    "train_micro_batch_size_per_gpu":"auto",
    "gradient_accumulation_steps":"auto",
    "gradient_clipping":"auto",
    "zero_allow_untested_optimizer":true,
    "fp16":{
        "enabled":false
    },
    "bfloat16":{
        "enabled":true
    },
    "zero_optimization":{
        "stage":3,
        "allgather_partitions":true,
        "reduce_scatter":true,
        "overlap_comm":false,
        "contiguous_gradients":true
    }
}

错误：

Parameter Offload: Total persistent parameters: 2053120 in 121 params
  0%|                                                                                                                                              | 0/6 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/root/autodl-tmp/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/root/autodl-tmp/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/root/autodl-tmp/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/root/autodl-tmp/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 71, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2758, in compute_loss
    outputs = model(**inputs)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1833, in forward
    loss = self.module(*inputs, **kwargs)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/peft/peft_model.py", line 1073, in forward
    return self.base_model(
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 103, in forward
    return self.model.forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py", line 705, in forward
    logits = self.lm_head(hidden_states)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py", line 513, in forward
    norm_weight = nn.functional.normalize(self.weight)
  File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/functional.py", line 4661, in normalize
    return input / denom
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.20 GiB (GPU 1; 23.65 GiB total capacity; 17.21 GiB already allocated; 678.06 MiB free; 22.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-12-23 22:52:24,054] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 5303
[2023-12-23 22:52:24,471] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 5304

leoterry-ulrica commented 10 months ago

@hiyouga deepspeed命令有问题吗？依然是OOM。

hiyouga commented 10 months ago

减小batchsize

leoterry-ulrica commented 10 months ago

减小batchsize

deep_speed.json添加以下内容解决：

"offload_param": {
            "device": "cpu",
            "pin_memory": true
        }

leoterry-ulrica commented 10 months ago

另外train_web.py中加载模型支持多卡不？单卡加载不完13B模型。 @hiyouga

hiyouga commented 10 months ago

@leoterry-ulrica web_demo 支持多卡加载

leoterry-ulrica commented 10 months ago

@leoterry-ulrica web_demo 支持多卡加载

请问启动命令里面是添加什么参数让它能够多卡加载13B模型，避免一张卡加载出现OOM？现在的启动命令如下，是否需要修改？

python src/web_demo.py \
    --model_name_or_path /root/autodl-tmp/Baichuan2-13B-Chat \
    --adapter_name_or_path saves/Baichuan2-13B-Chat/lora/train_2023-12-24-09-54-04 \
    --template baichuan2 \
    --finetuning_type lora

luoqishuai commented 8 months ago

@leoterry-ulrica web_demo 支持多卡加载

请问启动命令里面是添加什么参数让它能够多卡加载13B模型，避免一张卡加载出现OOM？现在的启动命令如下，是否需要修改？
python src/web_demo.py \
    --model_name_or_path /root/autodl-tmp/Baichuan2-13B-Chat \
    --adapter_name_or_path saves/Baichuan2-13B-Chat/lora/train_2023-12-24-09-54-04 \
    --template baichuan2 \
    --finetuning_type lora

@hiyouga 大佬,web_demo怎么设置多卡呢?直接设置CUDA_VISIBLE_DEVICES=4,5,6,7的话4*80GB加载Qwen1.5-72B-Chat模型显存会OOM.看显卡使用情况,只使用了4号卡.在issues中没有找到对应的示例. 用vllm的话2*80GB就可以加载4096长度的同一模型. 代码版本大概是2024.03.04 启动脚本

PYTHONPATH=***/LLaMA-Factory CUDA_VISIBLE_DEVICES=4,5,6,7 WEB_PORT=6072 nohup python3 ./src/web_demo.py \
    --model_name_or_path ***/pretrain_model/Qwen1.5-72B-Chat  \
    --template qwen \
    --finetuning_type full \
    --repetition_penalty 1.2 \
    --cutoff_len 8192 > qwen1.5-72B-web.out &

hiyouga commented 8 months ago

@luoqishuai 请更新代码到最新版

luoqishuai commented 8 months ago

新版代码中该问题已解决,多谢大佬!

hiyouga / LLaMA-Factory