sft使用使用int8量化时候出错.

is commented 1 year ago

错误信息

Traceback (most recent call last):
  File "/home/jovyan/work/ChatGLM-Efficient-Tuning/src/train_sft.py", line 100, in <module>
    main()
  File "/home/jovyan/work/ChatGLM-Efficient-Tuning/src/train_sft.py", line 68, in main
    train_result = trainer.train()
  File "/home/jovyan/.conda/envs/chatglm/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/home/jovyan/.conda/envs/chatglm/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/jovyan/.conda/envs/chatglm/lib/python3.10/site-packages/transformers/trainer.py", line 2770, in training_step
    self.accelerator.backward(loss)
  File "/home/jovyan/.conda/envs/chatglm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1819, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/jovyan/.conda/envs/chatglm/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/jovyan/.conda/envs/chatglm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/jovyan/.conda/envs/chatglm/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/home/jovyan/.conda/envs/chatglm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/jovyan/.conda/envs/chatglm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/jovyan/.conda/envs/chatglm/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/home/jovyan/.conda/envs/chatglm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 479, in backward
    grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)
RuntimeError: expected scalar type Half but found Float

正常lora训练，只是增加了--quantization_bit 8这一个参数。

is commented 1 year ago

python src/train_sft.py \
--do_train \
--do_eval False \
--model_name_or_path $WORK_DIR/hf/THUDM--chatglm-6b \
--lora_rank 32 \
--lora_alpha 128 \
--max_source_length 1200 \
--max_target_length 800 \
--preprocessing_num_workers 16 \
--dataset self_cognition_train \
--dataset_dir $WORK_DIR/data \
--cache_dir cache \
--finetuning_type lora \
--output_dir llm_ft_ckpt/v7 \
--overwrite_cache \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--save_strategy steps \
--logging_steps 10 \
--eval_steps 100000 \
--save_steps 1000 \
--learning_rate 2e-5 \
--num_train_epochs 5.0 \
--plot_loss \
--fp16 \
--ddp_find_unused_parameters False

以上参数没问题

--quantization_bit 8

就会提示出错

看堆栈是BNB的autograd，是不是漏了什么导致训练到主transformer上了?

is commented 1 year ago

之前的经验，ChatGLM并不支持load_in_8bit这样标准hf量化装载方式吧?

is commented 1 year ago

下面这段代码，是跑不出正常结果的.

import torch
import transformers

MODEL_PATH = 'THUDM/chatglm-6b'

tokenizer = transformers.AutoTokenizer.from_pretrained(
    MODEL_PATH, trust_remote_code=True)
model = transformers.AutoModel.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.float16,
    load_in_8bit=True,
    trust_remote_code=True).cuda()

model.eval()
m0 = model.chat(query="你好", tokenizer=tokenizer, max_length=1024)
print(m0)
input("hello")

is commented 1 year ago

load_in_8bit不能正常推理

q_config = BitsAndBytesConfig(
    load_in_8bit=True)

model = AutoModel.from_pretrained(
    MODEL_PATH,
    device_map='auto',
    quantization_config=q_config,
    trust_remote_code=True)

m0 = model.chat(query="你好", tokenizer=tokenizer, max_length=1024)
print(m0)

load_in_4bit反而没问题....

q_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16)

model = AutoModel.from_pretrained(
    MODEL_PATH,
    quantization_config=q_config,
    trust_remote_code=True)

#model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
m0 = model.chat(query="你好", tokenizer=tokenizer, max_length=1024)

hiyouga commented 1 year ago

你用的是什么显卡？可能只是显卡不支持量化训练。

is commented 1 year ago

你用的是什么显卡？可能只是显卡不支持量化训练。

V100S 这个项目 https://github.com/shuxueslpi/chatGLM-6B-QLoRA 能跑起来. 不过他也只是使用了int4量化.

is commented 1 year ago

用了quantization_bit 4就能正常跑了...

is commented 1 year ago

简单的说就是int8不行，int4没问题。

就是ChatGLM和BNB int8貌似是不兼容的。

hiyouga commented 1 year ago

不是 ChatGLM 和 int8 不兼容，而是 V100 不支持 int8 量化

is commented 1 year ago

不是 ChatGLM 和 int8 不兼容，而是 V100 不支持 int8 量化

大佬用的什么卡？我在A100上试了一下，也还是不对。

hiyouga commented 1 year ago

A100 上我测试过，能正常运行，请检查 CUDA 和依赖库版本。

is commented 1 year ago

我...又来分享经验了...

貌似并不是卡跑不起来，我现在测试的是更古老的P100.

diff a2_0__bnb_q8__not_work.py a2_3__bnb_q8_retry.py 
57c57
<         llm_int8_threshold=6.0
---
>         llm_int8_threshold=4.5

把llm_int8_threshold调整小，就能跑了..

https://huggingface.co/docs/transformers/main_classes/quantization#finetune-a-model-that-has-been-loaded-in-8bit

Play with llm_int8_threshold You can play with the llm_int8_threshold argument to change the threshold of the outliers. An “outlier” is a hidden state value that is greater than a certain threshold. This corresponds to the outlier threshold for outlier detection as described in LLM.int8() paper. Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning). This argument can impact the inference speed of the model. We suggest to play with this parameter to find which one is the best for your usecase.

hiyouga / ChatGLM-Efficient-Tuning

sft使用使用int8量化时候出错. #174