intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.51k stars 1.24k forks source link

Model output is different when using default optimize_model #10782

Open vishnumadhu365 opened 5 months ago

vishnumadhu365 commented 5 months ago

While testing ipex-llm I observed a difference in model output after calling optimize_model() which defaulted to sym_int4. Please help clarify the following:

  1. What is causing this variation in output ?
  2. Does optimize_model() call ensure that the model accuracy remains the same across eval benchmarks like human eval, mmlu etc ?

Thanks!

env : Python 3.9 ipex-llm 2.1.0b20240416 torch 2.2.2 transformers 4.31.0

reproducer:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["TRANSFORMERS_VERBOSITY"] = "error"

import sys
import warnings
warnings.filterwarnings("ignore")

import torch
torch.manual_seed(100)

from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = 'meta-llama/Llama-2-7b-chat-hf'
model = AutoModelForCausalLM.from_pretrained(model_path,
                                                 trust_remote_code=True,
                                                 use_cache=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, 
                                           trust_remote_code=True)

system_prompt = "You are a creative poet. Write a poem about the given topic. Use only 100 words"
user_prompt = "Write a poem about owls and starry nights"
prompt_template = f"<s>[INST] <<SYS>>\n {system_prompt} \n<</SYS>>\n\n {user_prompt}  [/INST]"

print("*"*10 + "Original model output" + "*"*10)
print(tokenizer.decode(model.generate(tokenizer.encode(prompt_template, return_tensors="pt"), max_new_tokens=100)[0], skip_special_tokens=True))
sys.stdout.flush()

from ipex_llm import optimize_model
model = optimize_model(model)

print("*"*10 + "IPEX-LLM Optimized model output" + "*"*10)
print(tokenizer.decode(model.generate(tokenizer.encode(prompt_template, return_tensors="pt"), max_new_tokens=100)[0], skip_special_tokens=True))
sys.stdout.flush()

output:

**********Original model output**********
[INST] <<SYS>>
 You are a creative poet. Write a poem about the given topic. Use only 100 words 
<</SYS>>

 Write a poem about owls and starry nights  [/INST]  Sure! Here is a 100-word poem about owls and starry nights:

Silent sentinels of the night,
Owls perch on boughs, their eyes alight.
Glittering stars above, a twinkling sight,
A magical night, pure delight.
Converting the current model to sym_int4 format......
**********IPEX-LLM Optimized model output**********
[INST] <<SYS>>
 You are a creative poet. Write a poem about the given topic. Use only 100 words 
<</SYS>>

 Write a poem about owls and starry nights  [/INST]  Sure, here is a poem about owls and starry nights in exactly 100 words:

Owls hoot in the night's embrace
Their soft coos echo through space
While stars twinkle bright and slow
A celestial show to know
Nature's symphony so grand
In this peaceful night's command
hkvision commented 5 months ago

Hi,

We are doing some further optimizations in ipex-llm for optimal performance, which may change some logits and outputs, this is expected. But at the same time, we are running accuracy benchmarks (e.g. the tasks in https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) to make sure that our optimizations don't have any obvious negative impacts in the accuracy. If you observe any wrong output with the ipex-llm optimized model, feel free to tell us and we will check it. Thanks!