intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.55k stars 1.25k forks source link

Evaluation on if MiniCPM-2B-sft-bf16 need model based optimization on ipex-llm #11163

Open wluo1007 opened 4 months ago

wluo1007 commented 4 months ago

Below are the benchmark results on both THUDM/chatglm3-6b and openbmb/MiniCPM-2B-sft-bf16, from which we can see that chatglm3-6b has better throughput than miniCPM-2b. Considering MiniCPM-2b is a 2B model while chatglm3-6b is a 6B model, I'm not sure if the results are considered normal or further optimization should be done on miniCPM-2b.

Platform: Core ultra 7 165H, 32GB*2=64GB DDR5 5600 MT/s, ubuntu 22.04, Ipex-llm 2.1.0b20240526

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Model Data type Batch Input Output First token latency (ms/token) Rest token latency(ms/token)
THUDM/chatglm3-6b INT4-SYM 1 32 32 468.38 46.98
  INT4-SYM 1 1024 128 3997.21 48.87
  INT4-SYM 1 1024 1024 3987.13 49.31
  INT4-SYM 1 2048 1024 8607.79 50.25
openbmb/MiniCPM-2B-sft-bf16 INT4-SYM 1 32 32 258.62 44.72
  INT4-SYM 1 1024 128 2602 65.81
  INT4-SYM 1 1024 1024 2720.03 80.4
  INT4-SYM 1 2048 1024 6910.26 112.05

qiuxin2012 commented 4 months ago

We have done some optimizations for MiniCPM-2B-sft-bf16, can you try the latest ipex-llm 2.1.0b20240603?

wluo1007 commented 3 months ago

Platform: Core ultra 7 165H, 32GB*2=64GB DDR5 5600 MT/s, ubuntu 22.04, Ipex-llm 2.1.0b20240606 Compared to previous version, the performance improvement is obvious.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

model | 1st token avg latency (ms) | 2+ avg latency (ms/token) | input/output tokens | batch_size | low_bit -- | -- | -- | -- | -- | -- openbmb/MiniCPM-2B-sft-bf16 | 246.71 | 24.11 | 32-32 | 1 | sym_int4 openbmb/MiniCPM-2B-sft-bf16 | 2493.34 | 27.85 | 1024-128 | 1 | sym_int4 openbmb/MiniCPM-2B-sft-bf16 | 2626.22 | 29.96 | 1024-1024 | 1 | sym_int4 openbmb/MiniCPM-2B-sft-bf16 | 6618.25 | 34.09 | 2048-1024 | 1 | sym_int4