intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.67k stars 1.26k forks source link

Questions about performance gap between benchmark scripts and llama-bench from ipex-llm[cpp] #12280

Closed acane77 closed 5 days ago

acane77 commented 1 week ago

Background

We evaluate the performance with llama-bench from ipex-llm[cpp] and the benchmark script , to compare with the benchmark results from this image.

We found the benchmark script, which use transformers pipeline and pytorch backend achieves better performance than using llama-bench (llama-bench evaluate the prefill and decode speed repesctively and no sampling during decoding at all, it should have been faster than normal LLM generate pipeline).

We run the benchmarks on Ubuntu 22.04 and Intel Ultra 7 155H.

The steps and our results

The llama-bench (the original version) results:

./llama-bench -m model.gguf -n 128 -p 365,876,3376 -t 16 -ub 2048 -b 2048 -r 5  
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
| model                          |       size |     params | backend    | ngl | threads | n_ubatch |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | ------------: | ---------------: |
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.5|    128|    1024|   32| 30655M|            1.3.30872|
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | SYCL       |  99 |      16 |     2048 |         pp365 |    438.78 ± 4.06 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | SYCL       |  99 |      16 |     2048 |         pp876 |    563.76 ± 9.62 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | SYCL       |  99 |      16 |     2048 |        pp3376 |    418.48 ± 2.02 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | SYCL       |  99 |      16 |     2048 |         tg128 |     26.42 ± 0.32 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | SYCL       |  99 |      16 |     2048 |         tg256 |     26.38 ± 0.08 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | SYCL       |  99 |      16 |     2048 |         tg512 |     25.35 ± 0.89 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | SYCL       |  99 |      16 |     2048 |        tg1024 |     25.17 ± 0.69 |

build: d33728a (1)

As mentioned above, first, we made some modification to llama-bench, to make it run decode after prefilling, and show prefill and decode speed respectively.

Code at here: https://github.com/acane77/llama.cpp/tree/dev_ipex_mod

We build the llama-bench with the following scripts

cmake -B build -DBUILD_SHARED_LIBS=ON -DLLAMA_SYCL=1 -DLLAMA_CLBLAST=1 -DGGML_SYCL=ON -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j
cmake --install build --prefix install
cp ./install/bin/llama-bench ~/projects/llama-cpp/llama-bench-emb

where the ~/projects/llama-cpp/llama-bench-emb is created by llama-init, and the libs are linked to ipex-llm venv.

Then, we run this llama-bench-emb (our modified version), the results is as following.

./llama-bench-emb -m model.gguf -n 128 -p 365,876,3376 -t 16 -ub 2048 -b 2048 -r 5  
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
| model                          |       size |     params | backend    | ngl |    threads |   n_ubatch |          test |    prefill (t/s) |     decode (t/s) |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------: | ---------------: | ---------------: |
-- Note: Use embedding as model input  >> found prompt: 365
  >> found prompt: 876
  >> found prompt: 3376
  << found decode: 128
****** Add test: prompt 365   decode: 128
****** Add test: prompt 876   decode: 128
****** Add test: prompt 3376   decode: 128
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.5|    128|    1024|   32| 30655M|            1.3.30872|
----> test case: n_prompt=365, n_gen=128
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | SYCL       |  99 |         16 |       2048 |   pp365+tg128 |  479.02 ± 11.30 |    27.60 ± 0.22 |
----> test case: n_prompt=876, n_gen=128
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | SYCL       |  99 |         16 |       2048 |   pp876+tg128 |   562.92 ± 3.75 |    26.76 ± 0.14 |
----> test case: n_prompt=3376, n_gen=128
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | SYCL       |  99 |         16 |       2048 |  pp3376+tg128 |  419.32 ± 27.55 |    21.75 ± 1.43 |

build: 1da2df74 (3009)

While the following table is generated by ipex benchmark script.

,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,microsoft/Phi-3-mini-128k-instruct,729.85,31.58,0.0,365-128,1,366-128,1,sym_int4,,4.33,2.431640625,N/A,N/A
1,microsoft/Phi-3-mini-128k-instruct,1383.27,32.73,0.0,778-128,1,779-128,1,sym_int4,,4.33,2.576171875,N/A,N/A
2,microsoft/Phi-3-mini-128k-instruct,8095.65,32.73,0.0,3667-128,1,3668-128,1,sym_int4,,4.33,4.080078125,N/A,N/A
,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,microsoft/Phi-3-mini-128k-instruct,241.83,30.39,0.0,32-32,1,33-32,1,sym_int4,,3.92,2.474609375,N/A,N/A
1,microsoft/Phi-3-mini-128k-instruct,1852.87,32.42,0.0,960-64,1,961-64,1,sym_int4,,3.92,2.98046875,N/A,N/A
2,microsoft/Phi-3-mini-128k-instruct,2162.56,32.94,0.0,1024-128,1,1025-128,1,sym_int4,,3.92,2.94921875,N/A,N/A
,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,microsoft/Phi-3-mini-128k-instruct,783.59,30.04,0.0,365-128,1,366-128,1,sym_int4,,3.87,2.681640625,N/A,N/A
1,microsoft/Phi-3-mini-128k-instruct,1614.28,31.18,0.0,778-128,1,779-128,1,sym_int4,,3.87,2.82421875,N/A,N/A
2,microsoft/Phi-3-mini-128k-instruct,12719.86,32.0,0.0,3667-128,1,3668-128,1,sym_int4,,3.87,5.5,N/A,N/A
3,microsoft/Phi-3-mini-128k-instruct,810.81,32.42,0.0,365-128,1,366-128,1,sym_int4,,10.96,2.681640625,N/A,N/A
4,microsoft/Phi-3-mini-128k-instruct,1618.89,33.6,0.0,778-128,1,779-128,1,sym_int4,,10.96,2.8125,N/A,N/A

Questions

Is there any reason for this significant performance gap between the python transformers benchmark and llama-bench?

The difference is that the pytorch benchmark uses xpu device while llama-bench uses gpu.

qiuxin2012 commented 1 week ago

Please make sure your performance data are the same format. For example, python transformers benchmark's 1st is total time of 1st token, 2nd is ms per token. What's the format of llama-bench? It looks like number of tokens per second.

qiuxin2012 commented 1 week ago

I just notice llama-bench's format is prefill (t/s) | decode (t/s) | For 365 tokens, 1st total time is 365 / 479 = 0.792 s, 2nd is 1000 / 27.6 = 36.23 ms/token. It shows python transformers is a little faster.

acane77 commented 1 week ago

Yes, we also noticed this. We also tried different configuations (batch size, ubatch size, thread numbers), but all these performance are lower than the transformers results.