单机多卡lora报CUDA out of memory

QuantumDriver commented 1 year ago

相关环境：

In [1]: import torch

In [2]: import accelerate

In [3]: print(torch.__version__)
2.0.1+cu117

In [4]: print(accelerate.__version__)
0.19.0

运行代码：

accelerate launch src/train_sft.py \
    --do_train \
    --use_v2 True \
    --ddp_find_unused_parameters False \
    --dataset adgen_train \
    --finetuning_type lora \
    --output_dir adgen_lora \
    --overwrite_cache \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 2000 \
    --learning_rate 1e-3 \
    --num_train_epochs 2.0 \
    --lora_rank 4 \
    --ddp_find_unused_parameters False \
    --source_prefix 你现在是一名销售员，根据以下商品标签生成一段有吸引力的商品广告词。 \
    --plot_loss \
    --fp16

报错信息：

Traceback (most recent call last):
  File "/data/ouxin/llm/code/chatglm_et/src/train_sft.py", line 114, in <module>
    main()
  File "/data/ouxin/llm/code/chatglm_et/src/train_sft.py", line 58, in main
    trainer = Seq2SeqTrainerForChatGLM(
  File "/data/ouxin/llm/code/chatglm_et/src/utils/peft_trainer.py", line 80, in __init__
    super().__init__(**kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 56, in __init__
    super().__init__(
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 499, in __init__
    self._move_model_to_device(model, args.device)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 741, in _move_model_to_device
    model = model.to(device)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 5 more times]
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB (GPU 0; 10.76 GiB total capacity; 10.09 GiB already allocated; 77.56 MiB free; 10.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

GPU情况：

Mon Jul 10 19:57:42 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:5A:00.0 Off |                  N/A |
| 27%   26C    P8    15W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:5E:00.0 Off |                  N/A |
| 27%   24C    P8    22W / 250W |  10210MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:62:00.0 Off |                  N/A |
| 27%   25C    P8    21W / 250W |    292MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:66:00.0 Off |                  N/A |
| 27%   25C    P8    20W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:B5:00.0 Off |                  N/A |
| 27%   23C    P8     9W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  On   | 00000000:B9:00.0 Off |                  N/A |
| 27%   24C    P8    22W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  On   | 00000000:BD:00.0 Off |                  N/A |
| 27%   24C    P8     1W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  On   | 00000000:C1:00.0 Off |                  N/A |
| 27%   25C    P8     1W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

~/.cache/huggingface/accelerate目录下default_config.yaml的配置：

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1,2,3,4,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

在跑代码的时候，有实时监控GPU的变化情况，在main()函数运行到trainer = Seq2SeqTrainerForChatGLM(...之前，GPU上1-7均有500MB左右的占用（GPU 0显存占用为0，很神奇，明明在accelerate里面配置了8块一起才对...），运行到trainer = ...这个位置的时候，观察到GPU 0的显存从0突然开始暴涨，且GPU1-7均无变化，然后GPU 0就爆显存了。。。

可以麻烦大佬帮忙看下是什么问题吗？(╥╯^╰╥) 提前感谢啦！

hiyouga commented 1 year ago

多卡并不能节省单张卡上面的显存，12G 跑 fp16 的 LoRA 有点勉强，试着开一下量化

QuantumDriver commented 1 year ago

多卡并不能节省单张卡上面的显存，12G 跑 fp16 的 LoRA 有点勉强，试着开一下量化

您好，我更新好相关库之后，运行了4bit量化的代码（去掉了--fp16，加入了--quantization_bit 4），如下：

accelerate launch src/train_sft.py \
    --do_train \
    --use_v2 True \
    --ddp_find_unused_parameters False \
    --dataset adgen_train \
    --finetuning_type lora \
    --output_dir adgen_lora \
    --overwrite_cache \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 2000 \
    --learning_rate 1e-3 \
    --num_train_epochs 2.0 \
    --lora_rank 4 \
    --ddp_find_unused_parameters False \
    --source_prefix 你现在是一名销售员，根据以下商品标签生成一段有吸引力的商品广告词。 \
    --plot_loss \
    --quantization_bit 4 \
    --output_dir lora_test

这一次运行的结果并没有看到报错的信息，给我提示了几个INFO程序就直接结束了，相关信息如下（截取了最后的一部分）：

Loading checkpoint shards: 100%|████████████████████████████████| 7/7 [00:27<00:00,  3.99s/it]07/11/2023 10:55:29 - INFO - utils.common - Fine-tuning method: LoRA
Loading checkpoint shards:  71%|██████████████████████▊         | 5/7 [00:21<00:07,  3.99s/it]07/11/2023 10:55:33 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
07/11/2023 10:55:33 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
07/11/2023 10:55:34 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
07/11/2023 10:55:34 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
07/11/2023 10:55:34 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
07/11/2023 10:55:34 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
Loading checkpoint shards: 100%|████████████████████████████████| 7/7 [00:26<00:00,  3.82s/it][INFO|modeling_utils.py:3295] 2023-07-11 10:55:37,862 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[INFO|modeling_utils.py:3303] 2023-07-11 10:55:37,862 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at THUDM/chatglm2-6b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training.
[INFO|modeling_utils.py:2927] 2023-07-11 10:55:38,111 >> Generation config file not found, using a generation config created from the model config.
07/11/2023 10:55:38 - INFO - utils.common - Fine-tuning method: LoRA
07/11/2023 10:55:43 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
Running tokenizer on dataset:   0%|                         | 0/114599 [00:00<?, ? examples/s]07/11/2023 10:55:43 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-98dad7a6dee9dd08/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-948d04d6fabae335.arrow
stage: sft                                                                                    
[INFO|trainer.py:399] 2023-07-11 10:56:54,440 >> You have loaded a model on multiple GPUs. `is_model_parallel` attribute will be force-set to `True` to avoid any unexpected behavior such as device placement mismatching.
[INFO|trainer.py:407] 2023-07-11 10:56:54,440 >> The model is loaded in 8-bit precision. To train this model you need to add additional modules inside the model such as adapters using `peft` library and freeze the model weights. Please check  the examples in https://github.com/huggingface/peft for more details.
stage: sft                                                                                    
stage: sft                                                                                    
stage: sft                                                                                    
stage: sft                                                                                    
stage: sft                                                                                    
stage: sft

我打开lora_test这个文件查看，里面是空的，o(╥﹏╥)o

说明一下，这个stage: sft是我在common.py的代码里面做了修改，修改如下：

        if stage == "sft":
            # print_sft_dataset_example(dataset[0])
            print("stage: sft")
        elif stage == "rm":
            print_pairwise_dataset_example(dataset[0])
        elif stage == "ppo":
            print_ppo_dataset_example(dataset[0])

另外，我对aacelerate的config部分也有改动，machine id也从原来的0,1,2,3,4,5,6,7修改为0,2,3,4,5,6,7。（id为1的显存有一个服务一直在占用，约10G/11G）

求大佬指点一下~

Yang-HangWA commented 1 year ago

@QuantumDriver 更新下代码，我刚用下面的命令试了一下

accelerate launch src/train_bash.py \
 --model_name_or_path /path/to/chatglm2-6b \
 —stage sft \
 --use_v2 \
 --do_train\
 --dataset adgen_train \
 --finetuning_type lora \
 --output_dir adgen_lora \
 --overwrite_cache \
 --per_device_train_batch_size 2\
 --gradient_accumulation_steps 2\
 --lr_scheduler_type cosine\
 --logging_steps 10 \
 --save_steps 2000 \
 --learning_rate 1e-3 \
 --num_train_epochs 2.0 \
 --lora_rank 32 \
 --ddp_find_unused_parameters False\
 --source_prefix你现在是一名销售员，根据以下商品标签生成一段有吸引力的商品广告词。 \
 --quantization_bit 8 \
 --plot_loss

跑起来一个显卡占8个g多，跑完adgen_lora目录下有一个trainer_log.jsonl文件，单行内容是：

{"current_steps": 10, "total_steps": 57300, "loss": 4.7845, "reward": null, "learning_rate": 0.000999999924849738, "epoch": 0.0, "percentage": 0.02, "elapsed_time": "0:00:17", "remaining_time": "1 day, 4:01:49"}

wu-xiaohua commented 1 year ago

多卡并不能节省单张卡上面的显存，12G 跑 fp16 的 LoRA 有点勉强，试着开一下量化

那是不是说多卡只能是加快速度？

liuyijiang1994 commented 1 year ago

@QuantumDriver 更新下代码，我刚用下面的命令试了一下

accelerate launch src/train_bash.py \
 --model_name_or_path /path/to/chatglm2-6b \
 —stage sft \
 --use_v2 \
 --do_train\
 --dataset adgen_train \
 --finetuning_type lora \
 --output_dir adgen_lora \
 --overwrite_cache \
 --per_device_train_batch_size 2\
 --gradient_accumulation_steps 2\
 --lr_scheduler_type cosine\
 --logging_steps 10 \
 --save_steps 2000 \
 --learning_rate 1e-3 \
 --num_train_epochs 2.0 \
 --lora_rank 32 \
 --ddp_find_unused_parameters False\
 --source_prefix你现在是一名销售员，根据以下商品标签生成一段有吸引力的商品广告词。 \
 --quantization_bit 8 \
 --plot_loss

跑起来一个显卡占8个g多，跑完adgen_lora目录下有一个trainer_log.jsonl文件，单行内容是：

{"current_steps": 10, "total_steps": 57300, "loss": 4.7845, "reward": null, "learning_rate": 0.000999999924849738, "epoch": 0.0, "percentage": 0.02, "elapsed_time": "0:00:17", "remaining_time": "1 day, 4:01:49"}

您好，请教一下您的各库的版本和cuda的版本是多少？

hiyouga / ChatGLM-Efficient-Tuning

单机多卡lora报CUDA out of memory #273