Closed jfzleo closed 2 days ago
训练和推理模板不一致
训练和推理模板不一致
请问python脚本中推理模板应该在哪里配置? 训练命令:
torchrun $DISTRIBUTED_ARGS src/train.py \
--deepspeed $DS_CONFIG_PATH \
--stage sft \
--do_train \
--use_fast_tokenizer \
--flash_attn "auto" \
--model_name_or_path $MODEL_PATH \
--dataset $DATASET_NAME \
--template qwen \
--finetuning_type lora \
--lora_target all \
--output_dir $OUTPUT_PATH \
--overwrite_cache \
--overwrite_output_dir \
--warmup_steps 100 \
--weight_decay 0.1 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--ddp_timeout 9000 \
--learning_rate 5e-5 \
--lr_scheduler_type cosine \
--logging_steps 1 \
--cutoff_len 32768 \
--save_steps 1000 \
--plot_loss \
--num_train_epochs 3 \
--bf16
load模型:
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
推理python脚本:
text1 = tokenizer.apply_chat_template(
message1,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text1], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=max_gen
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
python 推理要设置 eos
发现原因来自训练保存的tokenizer_config.json中会修改chat_template pretrained tokenizer_config.json:
{"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"}
sft tokenizer_config.json:
{"chat_template": "{% set system_message = 'You are a helpful assistant.' %}{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ '<|im_start|>system\n' + system_message + '<|im_end|>\n' }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|im_start|>user\n' + content + '<|im_end|>\n<|im_start|>assistant\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<|im_end|>' + '\n' }}{% endif %}{% endfor %}"}
推理时load预训练的tokenizer,推理结果正常:
tokenizer = AutoTokenizer.from_pretrained(pretrained_tokenizer_path)
load推理模型后,使用get_template_and_fix_tokenizer推理结果仍有问题:
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
get_template_and_fix_tokenizer(tokenizer, "qwen")
请问这是否是个bug?
encode 之后的 token id 有区别吗
apply_chat_template后的文本和encode之后的token_id都不同。代码:
tokenizer_sft = AutoTokenizer.from_pretrained(sft_path)
tokenizer_pretrained = AutoTokenizer.from_pretrained(pretrained_path)
prompt = "你是对话判断助手"
messages = [
{"role": "system", "content": prompt}
]
text_sft = tokenizer_sft.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
print(text_sft)
model_inputs = tokenizer([text_sft], return_tensors="pt").to(device)
print(model_inputs)
text_pretrained = tokenizer_pretrained.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
print(text_pretrained)
model_inputs = tokenizer([text_pretrained], return_tensors="pt").to(device)
print(model_inputs)
输出:
<|im_start|>system
你是对话判断助手<|im_end|>
{'input_ids': tensor([[151644, 8948, 198, 105043, 105051, 104317, 110498, 151645, 198]],
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
<|im_start|>system
你是对话判断助手<|im_end|>
<|im_start|>assistant
{'input_ids': tensor([[151644, 8948, 198, 105043, 105051, 104317, 110498, 151645, 198,
151644, 77091, 198]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
看起来你没有给出 user message 所以编码不对
给出user message后问题解决了,非常感谢!
prompt = "请帮我判断对话是否清晰"
messages = [
{"role": "system", "content": "你是对话判断助手"},
{"role": "user", "content": prompt}
]
···
输出
<|im_start|>system
你是对话判断助手<|im_end|>
<|im_start|>user
请帮我判断对话是否清晰<|im_end|>
<|im_start|>assistant
{'input_ids': tensor([[151644, 8948, 198, 105043, 105051, 104317, 110498, 151645, 198,
151644, 872, 198, 14880, 108965, 104317, 105051, 64471, 104542,
151645, 198, 151644, 77091, 198]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
device='cuda:0')}
<|im_start|>system
你是对话判断助手<|im_end|>
<|im_start|>user
请帮我判断对话是否清晰<|im_end|>
<|im_start|>assistant
{'input_ids': tensor([[151644, 8948, 198, 105043, 105051, 104317, 110498, 151645, 198,
151644, 872, 198, 14880, 108965, 104317, 105051, 64471, 104542,
151645, 198, 151644, 77091, 198]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
device='cuda:0')}
Reminder
System Info
llamafactory
version: 0.8.3.dev0Reproduction
使用模型:
Expected behavior
期望合并后模型在通过AutoModelForCausalLM.from_pretrained()加载后,对输入进行二分类,仅输出”清晰“或”模糊“
Others
输出会随机出现assistant:前缀,与结果间随机通过空格或换行符"\n"分隔,如:
同样的数据,使用同样的代码,load预训练模型Qwen2-7B-Instruct输出格式正常。 请问问题可能出在哪里?感谢!