Closed QuantumDriver closed 1 year ago
多卡并不能节省单张卡上面的显存,12G 跑 fp16 的 LoRA 有点勉强,试着开一下量化
多卡并不能节省单张卡上面的显存,12G 跑 fp16 的 LoRA 有点勉强,试着开一下量化
您好,我更新好相关库之后,运行了4bit量化的代码(去掉了--fp16
,加入了--quantization_bit 4
),如下:
accelerate launch src/train_sft.py \
--do_train \
--use_v2 True \
--ddp_find_unused_parameters False \
--dataset adgen_train \
--finetuning_type lora \
--output_dir adgen_lora \
--overwrite_cache \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 2000 \
--learning_rate 1e-3 \
--num_train_epochs 2.0 \
--lora_rank 4 \
--ddp_find_unused_parameters False \
--source_prefix 你现在是一名销售员,根据以下商品标签生成一段有吸引力的商品广告词。 \
--plot_loss \
--quantization_bit 4 \
--output_dir lora_test
这一次运行的结果并没有看到报错的信息,给我提示了几个INFO
程序就直接结束了,相关信息如下(截取了最后的一部分):
Loading checkpoint shards: 100%|████████████████████████████████| 7/7 [00:27<00:00, 3.99s/it]07/11/2023 10:55:29 - INFO - utils.common - Fine-tuning method: LoRA
Loading checkpoint shards: 71%|██████████████████████▊ | 5/7 [00:21<00:07, 3.99s/it]07/11/2023 10:55:33 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
07/11/2023 10:55:33 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
07/11/2023 10:55:34 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
07/11/2023 10:55:34 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
07/11/2023 10:55:34 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
07/11/2023 10:55:34 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
Loading checkpoint shards: 100%|████████████████████████████████| 7/7 [00:26<00:00, 3.82s/it][INFO|modeling_utils.py:3295] 2023-07-11 10:55:37,862 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.
[INFO|modeling_utils.py:3303] 2023-07-11 10:55:37,862 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at THUDM/chatglm2-6b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training.
[INFO|modeling_utils.py:2927] 2023-07-11 10:55:38,111 >> Generation config file not found, using a generation config created from the model config.
07/11/2023 10:55:38 - INFO - utils.common - Fine-tuning method: LoRA
07/11/2023 10:55:43 - INFO - utils.common - Quantized model to 4 bit.
trainable params: 974848 || all params: 3389286400 || trainable%: 0.0288
Running tokenizer on dataset: 0%| | 0/114599 [00:00<?, ? examples/s]07/11/2023 10:55:43 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-98dad7a6dee9dd08/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-948d04d6fabae335.arrow
stage: sft
[INFO|trainer.py:399] 2023-07-11 10:56:54,440 >> You have loaded a model on multiple GPUs. `is_model_parallel` attribute will be force-set to `True` to avoid any unexpected behavior such as device placement mismatching.
[INFO|trainer.py:407] 2023-07-11 10:56:54,440 >> The model is loaded in 8-bit precision. To train this model you need to add additional modules inside the model such as adapters using `peft` library and freeze the model weights. Please check the examples in https://github.com/huggingface/peft for more details.
stage: sft
stage: sft
stage: sft
stage: sft
stage: sft
stage: sft
我打开lora_test
这个文件查看,里面是空的,o(╥﹏╥)o
说明一下,这个stage: sft
是我在common.py
的代码里面做了修改,修改如下:
if stage == "sft":
# print_sft_dataset_example(dataset[0])
print("stage: sft")
elif stage == "rm":
print_pairwise_dataset_example(dataset[0])
elif stage == "ppo":
print_ppo_dataset_example(dataset[0])
另外,我对aacelerate的config部分也有改动,machine id也从原来的0,1,2,3,4,5,6,7修改为0,2,3,4,5,6,7。(id为1的显存有一个服务一直在占用,约10G/11G)
求大佬指点一下~
@QuantumDriver 更新下代码,我刚用下面的命令试了一下
accelerate launch src/train_bash.py \
--model_name_or_path /path/to/chatglm2-6b \
—stage sft \
--use_v2 \
--do_train\
--dataset adgen_train \
--finetuning_type lora \
--output_dir adgen_lora \
--overwrite_cache \
--per_device_train_batch_size 2\
--gradient_accumulation_steps 2\
--lr_scheduler_type cosine\
--logging_steps 10 \
--save_steps 2000 \
--learning_rate 1e-3 \
--num_train_epochs 2.0 \
--lora_rank 32 \
--ddp_find_unused_parameters False\
--source_prefix你现在是一名销售员,根据以下商品标签生成一段有吸引力的商品广告词。 \
--quantization_bit 8 \
--plot_loss
跑起来一个显卡占8个g多,跑完adgen_lora
目录下有一个trainer_log.jsonl
文件,单行内容是:
{"current_steps": 10, "total_steps": 57300, "loss": 4.7845, "reward": null, "learning_rate": 0.000999999924849738, "epoch": 0.0, "percentage": 0.02, "elapsed_time": "0:00:17", "remaining_time": "1 day, 4:01:49"}
多卡并不能节省单张卡上面的显存,12G 跑 fp16 的 LoRA 有点勉强,试着开一下量化
那是不是说多卡只能是加快速度?
@QuantumDriver 更新下代码,我刚用下面的命令试了一下
accelerate launch src/train_bash.py \ --model_name_or_path /path/to/chatglm2-6b \ —stage sft \ --use_v2 \ --do_train\ --dataset adgen_train \ --finetuning_type lora \ --output_dir adgen_lora \ --overwrite_cache \ --per_device_train_batch_size 2\ --gradient_accumulation_steps 2\ --lr_scheduler_type cosine\ --logging_steps 10 \ --save_steps 2000 \ --learning_rate 1e-3 \ --num_train_epochs 2.0 \ --lora_rank 32 \ --ddp_find_unused_parameters False\ --source_prefix你现在是一名销售员,根据以下商品标签生成一段有吸引力的商品广告词。 \ --quantization_bit 8 \ --plot_loss
跑起来一个显卡占8个g多,跑完
adgen_lora
目录下有一个trainer_log.jsonl
文件,单行内容是:{"current_steps": 10, "total_steps": 57300, "loss": 4.7845, "reward": null, "learning_rate": 0.000999999924849738, "epoch": 0.0, "percentage": 0.02, "elapsed_time": "0:00:17", "remaining_time": "1 day, 4:01:49"}
您好,请教一下您的各库的版本和cuda的版本是多少?
相关环境:
运行代码:
报错信息:
GPU情况:
~/.cache/huggingface/accelerate
目录下default_config.yaml
的配置:在跑代码的时候,有实时监控GPU的变化情况,在
main()
函数运行到trainer = Seq2SeqTrainerForChatGLM(...
之前,GPU上1-7均有500MB左右的占用(GPU 0显存占用为0,很神奇,明明在accelerate里面配置了8块一起才对...),运行到trainer = ...
这个位置的时候,观察到GPU 0的显存从0突然开始暴涨,且GPU1-7均无变化,然后GPU 0就爆显存了。。。可以麻烦大佬帮忙看下是什么问题吗?(╥╯^╰╥) 提前感谢啦!