Closed is closed 1 year ago
python src/train_sft.py \
--do_train \
--do_eval False \
--model_name_or_path $WORK_DIR/hf/THUDM--chatglm-6b \
--lora_rank 32 \
--lora_alpha 128 \
--max_source_length 1200 \
--max_target_length 800 \
--preprocessing_num_workers 16 \
--dataset self_cognition_train \
--dataset_dir $WORK_DIR/data \
--cache_dir cache \
--finetuning_type lora \
--output_dir llm_ft_ckpt/v7 \
--overwrite_cache \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--save_strategy steps \
--logging_steps 10 \
--eval_steps 100000 \
--save_steps 1000 \
--learning_rate 2e-5 \
--num_train_epochs 5.0 \
--plot_loss \
--fp16 \
--ddp_find_unused_parameters False
以上参数没问题
--quantization_bit 8
就会提示出错
看堆栈是BNB的autograd,是不是漏了什么导致训练到主transformer上了?
之前的经验,ChatGLM并不支持load_in_8bit这样标准hf量化装载方式吧?
下面这段代码,是跑不出正常结果的.
import torch
import transformers
MODEL_PATH = 'THUDM/chatglm-6b'
tokenizer = transformers.AutoTokenizer.from_pretrained(
MODEL_PATH, trust_remote_code=True)
model = transformers.AutoModel.from_pretrained(
MODEL_PATH,
torch_dtype=torch.float16,
load_in_8bit=True,
trust_remote_code=True).cuda()
model.eval()
m0 = model.chat(query="你好", tokenizer=tokenizer, max_length=1024)
print(m0)
input("hello")
load_in_8bit不能正常推理
q_config = BitsAndBytesConfig(
load_in_8bit=True)
model = AutoModel.from_pretrained(
MODEL_PATH,
device_map='auto',
quantization_config=q_config,
trust_remote_code=True)
m0 = model.chat(query="你好", tokenizer=tokenizer, max_length=1024)
print(m0)
load_in_4bit反而没问题....
q_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16)
model = AutoModel.from_pretrained(
MODEL_PATH,
quantization_config=q_config,
trust_remote_code=True)
#model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
m0 = model.chat(query="你好", tokenizer=tokenizer, max_length=1024)
你用的是什么显卡?可能只是显卡不支持量化训练。
你用的是什么显卡?可能只是显卡不支持量化训练。
V100S 这个项目 https://github.com/shuxueslpi/chatGLM-6B-QLoRA 能跑起来. 不过他也只是使用了int4量化.
用了quantization_bit 4就能正常跑了...
简单的说就是int8不行,int4没问题。
就是ChatGLM和BNB int8貌似是不兼容的。
不是 ChatGLM 和 int8 不兼容,而是 V100 不支持 int8 量化
不是 ChatGLM 和 int8 不兼容,而是 V100 不支持 int8 量化
大佬用的什么卡? 我在A100上试了一下,也还是不对。
A100 上我测试过,能正常运行,请检查 CUDA 和依赖库版本。
我...又来分享经验了...
貌似并不是卡跑不起来,我现在测试的是更古老的P100.
diff a2_0__bnb_q8__not_work.py a2_3__bnb_q8_retry.py
57c57
< llm_int8_threshold=6.0
---
> llm_int8_threshold=4.5
把llm_int8_threshold调整小,就能跑了..
Play with llm_int8_threshold You can play with the llm_int8_threshold argument to change the threshold of the outliers. An “outlier” is a hidden state value that is greater than a certain threshold. This corresponds to the outlier threshold for outlier detection as described in LLM.int8() paper. Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning). This argument can impact the inference speed of the model. We suggest to play with this parameter to find which one is the best for your usecase.
错误信息
正常lora训练,只是增加了
--quantization_bit 8
这一个参数。