intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.68k stars 1.26k forks source link

tg speed of gemma2 is slower than upstream llama.cpp #11853

Closed ruihe774 closed 2 months ago

ruihe774 commented 2 months ago

Hi. I found the token generation speed of gemma2 in llama.cpp in ipex-llm[cpp] is slower than upstream llama.cpp. Can it be optimized?

ipex-llm[cpp]:

| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.27642|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:448
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         pp512 |    791.23 ± 7.36 |
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         tg128 |     16.29 ± 0.45 |

build: f6b084d (1)

upstream:

ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.27642|
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         pp512 |    642.29 ± 6.25 |
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         tg128 |     20.90 ± 0.17 |

build: cfac111e (3605)
rnwang04 commented 2 months ago

Hi @ruihe774 , we will take a look to see if we can reproduce this issue. Could you please also provide us with your detailed env information with our env check script? Besides, what's the script you used to obtain above results ?

ruihe774 commented 2 months ago

benchmark command:

$ ./llama-bench -m gemma2-9b-q4_0.gguf

The GGUF model file is pulled from https://ollama.com/library/gemma2

env check (I only use ipex-llm[cpp], so ipex-llm[xpu] is not installed):

-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.44.0
-----------------------------------------------------------------
torch=2.2.0+cpu
-----------------------------------------------------------------
ipex-llm WARNING: Package(s) not found: ipex-llm
-----------------------------------------------------------------
IPEX is not installed. 
-----------------------------------------------------------------
CPU Information: 
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        39 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               12
On-line CPU(s) list:                  0-11
Vendor ID:                            GenuineIntel
Model name:                           12th Gen Intel(R) Core(TM) i5-12490F
CPU family:                           6
Model:                                151
Thread(s) per core:                   2
Core(s) per socket:                   6
Socket(s):                            1
Stepping:                             2
CPU(s) scaling MHz:                   33%
CPU max MHz:                          4600.0000
CPU min MHz:                          800.0000
-----------------------------------------------------------------
Total CPU Memory: 94.1574 GB
-----------------------------------------------------------------
Operating System: 
Ubuntu 24.04 LTS \n \l

-----------------------------------------------------------------
Linux systemd-llm 6.10.4-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Aug 11 15:32:50 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.38.20240718
    Build ID: 0db09695

Service:
    Version: 1.2.38.20240718
    Build ID: 0db09695
    Level Zero Version: 1.16.0
-----------------------------------------------------------------
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver UUID                                     32332e34-332e-3032-3736-343200000000
  Driver Version                                  23.43.027642
-----------------------------------------------------------------
Driver related package version:
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed. 
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) A750 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0003-0000-000856a18086                                       |
|           | PCI BDF Address: 0000:03:00.0                                                        |
|           | DRM Device: /dev/dri/card1                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
lspci: Unable to load libkmod resources: error -2
GPU0 Memory [size=8G
-----------------------------------------------------------------
lspci: Unable to load libkmod resources: error -2
03:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A750] (rev 08) (prog-if 00 [VGA controller])
    Subsystem: Intel Corporation Device 1021
    Flags: bus master, fast devsel, latency 0, IRQ 159
    Memory at 80000000 (64-bit, non-prefetchable) [size=16M]
    Memory at 4000000000 (64-bit, prefetchable) [size=8G]
    Expansion ROM at 81000000 [disabled] [size=2M]
    Capabilities: <access denied>
    Kernel driver in use: i915

-----------------------------------------------------------------
rnwang04 commented 2 months ago

Hi @ruihe774 , we can't reproduce this issue on our Linux A750 machine.

upstream

$ 
./build/bin/llama-bench -m ~/gemma2-9b-it-q4_0.gguf
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.27191|
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         pp512 |    664.75 ± 1.96 |
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         tg128 |     16.33 ± 0.07 |

build: cfac111e (3605)

ours

$ ./llama-bench -m ~/gemma2-9b-it-q4_0.gguf 
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.27191|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:448
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         pp512 |   1211.28 ± 3.67 |
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         tg128 |     17.21 ± 0.10 |

build: f6b084d (1)

and here is the info of our machine:

env-check.sh

$ ./env-check.sh 
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.44.0
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm ./env-check.sh: /home/yons/miniforge3/envs/qiyue-cpp-0820/bin/pip: /home/arda/miniforge3/envs/qiyue-cpp-0820/bin/python3.11: bad interpreter: No such file or directory
-----------------------------------------------------------------
IPEX is not installed. 
-----------------------------------------------------------------
CPU Information: 
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             28
On-line CPU(s) list:                0-27
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Core(TM) i7-14700K
CPU family:                         6
Model:                              183
Thread(s) per core:                 2
Core(s) per socket:                 20
Socket(s):                          1
Stepping:                           1
CPU max MHz:                        5600.0000
CPU min MHz:                        800.0000
BogoMIPS:                           6835.20
-----------------------------------------------------------------
Total CPU Memory: 62.5534 GB
Memory Type: DDR5-A1 
-----------------------------------------------------------------
Operating System: 
Ubuntu 22.04.3 LTS \n \l

-----------------------------------------------------------------
Linux yons-B760M-AORUS-ELITE-AX 6.5.0-25-generic #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.22.20231126
    Build ID: 00000000

Service:
    Version: 1.2.22.20231126
    Build ID: 00000000
    Level Zero Version: 1.14.0
-----------------------------------------------------------------
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver UUID                                     32332e33-352e-3237-3139-312e34320000
  Driver Version                                  23.35.27191.42
  Driver UUID                                     32332e33-352e-3237-3139-312e34320000
  Driver Version                                  23.35.27191.42
-----------------------------------------------------------------
Driver related package version:
rc  intel-fw-gpu                                   2023.39.2-255~22.04                     all          Firmware package for Intel integrated and discrete GPUs
ii  intel-level-zero-gpu                           1.3.27191.42-775~22.04                  amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
igpu detected
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO  [23.35.27191.42]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.27191]
-----------------------------------------------------------------
xpu-smi is properly installed. 
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) UHD Graphics 770                                               |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0200-0000-0004a7808086                                       |
|           | PCI BDF Address: 0000:00:02.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 1         | Device Name: Intel(R) Arc(TM) A750 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0003-0000-000856a18086                                       |
|           | PCI BDF Address: 0000:03:00.0                                                        |
|           | DRM Device: /dev/dri/card1                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16M
GPU1 Memory [size=8G
-----------------------------------------------------------------
00:02.0 VGA compatible controller: Intel Corporation Device a780 (rev 04) (prog-if 00 [VGA controller])
        DeviceName: Onboard - Video
        Subsystem: Gigabyte Technology Co., Ltd Device d000
        Flags: bus master, fast devsel, latency 0, IRQ 173
        Memory at 6202000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=256M]
        I/O ports at 5000 [size=64]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: <access denied>
--
03:00.0 VGA compatible controller: Intel Corporation Device 56a1 (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Device 1ef7:1395
        Flags: bus master, fast devsel, latency 0, IRQ 174
        Memory at 41000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 6000000000 (64-bit, prefetchable) [size=8G]
        Expansion ROM at 42000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
-----------------------------------------------------------------

It seems your upstream performance is much better than we obtained. btw, I wonder have you set below env variables before running llama-bench ?

export ONEAPI_DEVICE_SELECTOR="level_zero:0" 
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
ruihe774 commented 2 months ago

I did not set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1. I'm using a very new kernel (6.10) and I found setting it degraded the performance.

With SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 set, the performance of upstream llama.cpp:

ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.27642|
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         pp512 |    635.04 ± 3.33 |
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         tg128 |     15.83 ± 0.09 |

build: cfac111e (3605)

Without:

ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.27642|
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         pp512 |    638.34 ± 6.54 |
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         tg128 |     20.62 ± 0.11 |

build: cfac111e (3605)

The performance of llama.cpp in ipex-llm[cpp] is not affect by SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS in my benchmark.

FWIW 20t/s is still very slow IMO. My computer can run qwen2 7b at 60t/s and llama3.1 9b at 40t/s. I wonder if gemma2 is not optimized yet or it demands more computational effort than other models.

rnwang04 commented 2 months ago

I did not set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1. I'm using a very new kernel (6.10) and I found setting it degraded the performance.

Thanks for pointing this ! Without this env setting,

upstream

ggml_sycl_init: found 1 SYCL devices:
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.27191|

| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         pp512 |    677.92 ± 1.45 |
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         tg128 |     22.22 ± 0.03 |

build: cfac111e (3605)

ours

| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.3|    448|    1024|   32|  8096M|            1.3.27191|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:448
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         pp512 |    827.55 ± 1.44 |
| gemma2 9B Q4_0                 |   5.76 GiB |    10.16 B | SYCL       |  99 |         tg128 |     22.32 ± 0.10 |

build: f6b084d (1)

It seems that the performance difference may be related to the kernel.

FWIW 20t/s is still very slow IMO. My computer can run qwen2 7b at 60t/s and llama3.1 9b at 40t/s. I wonder if gemma2 is not optimized yet or it demands more computational effort than other models.

In fact, both of these two factors are involved. Gemma2-9b itself has a larger model size and more complex calculations, and we have not yet optimized this model further.

ruihe774 commented 2 months ago

Thank you for pointing this out!

ruihe774 commented 2 months ago

Hi @rnwang04. I have a guess that the regression with a new kernel is caused by disabling automatic load CCS load balancing. With the new kernel (6.9 and 6.10), my card has only one CCS engine:

$ ls -d /sys/devices/path/to/drm/card1/engine/ccs*
ccs0

While on the old kernel (pre-6.8) that is not affected by the regression, my card has four:

$ ls -d /sys/devices/path/to/drm/card1/engine/ccs*
ccs0 ccs1 ccs2 ccs3

I have run my benchmark using the same version of compute-runtime (30049). The only difference is the kernel version.

I have no idea why the upstream llama.cpp is not affected by the regression. Maybe because it is compiled with newer libraries?

I wonder if you could reproduce the regression on the new kernel and were interested in fixing it.