intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.77k stars 1.27k forks source link

Baichuan2-13B with bigdl-bf16 does not apply greedy_search when calling model.generate #9948

Open Uxito-Ada opened 10 months ago

Uxito-Ada commented 10 months ago

This is a bigdl-bf16 model, where model_path is to a Baichuan2-13B-Chat:

# load
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             optimize_model=True,
                                             torch_dtype=torch.bfloat16,
                                             load_in_low_bit="bf16",
                                             trust_remote_code=True,
                                             use_cache=True)
# inference
original_output = model.generate(input_ids=input_ids,
                                 use_cache=False,
                                 max_new_tokens=args.n_predict,
                                 do_sample=False)

It is found greedy_search is not called as expected when using model.generate API.

By contrast, bigdl-int4 calls greedy_search as expected while applying the same style API as below:

draft_model = AutoModel.from_pretrained(model_path,
                                        load_in_4bit=True,
                                        optimize_model=True,
                                        trust_remote_code=True,
                                        use_cache=True)
draft_output = draft_model.generate(input_ids=input_ids,
                                    use_cache=True,
                                    max_new_tokens=args.n_predict,
                                    do_sample=False)

And this will finally influence the outputs, where bigdl-int4 and ipex-bf16 both use greedy _search and thus give closer answers, while bigdl-bf16 shows an difference:

dtype Output Tokens
bigdl-int4 1. 在创建栅格折痕的过程中,折叠不对成可能会导致整个折纸过程失败。这种错误可能会像蝴蝶效应一样,导致最终折叠出的玫瑰花形状不准确或无法成形。2. 在制作立体基座的过程中,高台的布局不仅要考虑长和宽这两个维度上的规整衬度和对称分布,还需要同时保证高这个维度上的整齐。如果高台布局不符合要求,可能会导致玫瑰花形状不准确或无法成形。3. 在完成花瓣修饰的阶段,如果花瓣形状没有接近自然中的玫瑰花瓣外形,可能会导致最终折叠出的玫瑰花不够逼真。此外,如果调整花瓣形状的力道控制不当,
ipex-bf16 折纸失败的原因可能有很多,但根据上述描述,以下几个步骤如果做错了很大可能会导致最终折叠失败:1. 创建栅格折痕:如果折叠过程中出现了折叠不对成的情况,可能会导致折纸失败。这种情况可能会像蝴蝶效应一样,一开始只是毫厘之差,最后可能就是天壤之别。2. 制作立体基座:在这个阶段,如果高台的布局没有考虑长和宽这两个维度上的规整衬度和对称分布,以及高这个维度上的整齐,可能会导致折纸失败。3. 完成花瓣修饰:在这个阶段,如果花瓣的形状没有通过自然的曲线去不断
bigdl-bf16 首先,在创建栅格折痕的过程中,如果出现折叠不对成的情况,可能会导致最终的折叠失败。这是因为折叠不对成可能会影响到后续的立体基座制作,甚至可能导致整个折纸过程的混乱。其次,在制作立体基座的过程中,如果高台的布局没有考虑到长、宽、高三个维度上的整齐和对称分布,也可能会导致最终的折叠失败。这是因为高台的布局直接影响到花瓣的形状和排列,从而影响整个玫瑰花的形状。最后,在完成花瓣修饰的阶段,如果没有充分理解大自然中玫瑰花的外形,并借助自然的曲线去不断修正花瓣的形状,也可能导致最终的折叠失败。这是因为花瓣的形状直接

Hope bigdl-bf16's service owner can help to fix it pls.

rnwang04 commented 10 months ago

I can't reproduce this issue. Based on my test, below code will call self.greedy_search.

code

import torch
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

# load
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             optimize_model=True,
                                             torch_dtype=torch.bfloat16,
                                             load_in_low_bit="bf16",
                                             trust_remote_code=True,
                                             use_cache=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

input_str = "tell me a story"
input_ids = tokenizer.encode(input_str, return_tensors="pt")

# inference
original_output = model.generate(input_ids=input_ids,
                                 use_cache=False,
                                 max_new_tokens=13,
                                 do_sample=False)
output_str = tokenizer.decode(original_output[0], skip_special_tokens=True)
print(original_output)
print(output_str)
Uxito-Ada commented 10 months ago

Is import intel_extension_for_pytorch as ipex necessary? As import will do some init works. @rnwang04

rnwang04 commented 10 months ago

Is import intel_extension_for_pytorch as ipex necessary? As import will do some init works. @rnwang04

It's not necessary, here use ipex as I validate in a GPU conda env. I have double checked in a CPU conda env, and confirm that it does use greedy search. And I also found that our bf16 has same output with native bf16 in CPU env.