intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.76k stars 1.27k forks source link

Model output is different when using default optimize_model #10782

Open vishnumadhu365 opened 7 months ago

vishnumadhu365 commented 7 months ago

While testing ipex-llm I observed a difference in model output after calling optimize_model() which defaulted to sym_int4. Please help clarify the following:

  1. What is causing this variation in output ?
  2. Does optimize_model() call ensure that the model accuracy remains the same across eval benchmarks like human eval, mmlu etc ?

Thanks!

env : Python 3.9 ipex-llm 2.1.0b20240416 torch 2.2.2 transformers 4.31.0

reproducer:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["TRANSFORMERS_VERBOSITY"] = "error"

import sys
import warnings
warnings.filterwarnings("ignore")

import torch
torch.manual_seed(100)

from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = 'meta-llama/Llama-2-7b-chat-hf'
model = AutoModelForCausalLM.from_pretrained(model_path,
                                                 trust_remote_code=True,
                                                 use_cache=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, 
                                           trust_remote_code=True)

system_prompt = "You are a creative poet. Write a poem about the given topic. Use only 100 words"
user_prompt = "Write a poem about owls and starry nights"
prompt_template = f"<s>[INST] <<SYS>>\n {system_prompt} \n<</SYS>>\n\n {user_prompt}  [/INST]"

print("*"*10 + "Original model output" + "*"*10)
print(tokenizer.decode(model.generate(tokenizer.encode(prompt_template, return_tensors="pt"), max_new_tokens=100)[0], skip_special_tokens=True))
sys.stdout.flush()

from ipex_llm import optimize_model
model = optimize_model(model)

print("*"*10 + "IPEX-LLM Optimized model output" + "*"*10)
print(tokenizer.decode(model.generate(tokenizer.encode(prompt_template, return_tensors="pt"), max_new_tokens=100)[0], skip_special_tokens=True))
sys.stdout.flush()

output:

**********Original model output**********
[INST] <<SYS>>
 You are a creative poet. Write a poem about the given topic. Use only 100 words 
<</SYS>>

 Write a poem about owls and starry nights  [/INST]  Sure! Here is a 100-word poem about owls and starry nights:

Silent sentinels of the night,
Owls perch on boughs, their eyes alight.
Glittering stars above, a twinkling sight,
A magical night, pure delight.
Converting the current model to sym_int4 format......
**********IPEX-LLM Optimized model output**********
[INST] <<SYS>>
 You are a creative poet. Write a poem about the given topic. Use only 100 words 
<</SYS>>

 Write a poem about owls and starry nights  [/INST]  Sure, here is a poem about owls and starry nights in exactly 100 words:

Owls hoot in the night's embrace
Their soft coos echo through space
While stars twinkle bright and slow
A celestial show to know
Nature's symphony so grand
In this peaceful night's command
hkvision commented 7 months ago

Hi,

We are doing some further optimizations in ipex-llm for optimal performance, which may change some logits and outputs, this is expected. But at the same time, we are running accuracy benchmarks (e.g. the tasks in https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) to make sure that our optimizations don't have any obvious negative impacts in the accuracy. If you observe any wrong output with the ipex-llm optimized model, feel free to tell us and we will check it. Thanks!