Closed ruihe774 closed 2 months ago
Hi @ruihe774 , we will take a look to see if we can reproduce this issue. Could you please also provide us with your detailed env information with our env check script? Besides, what's the script you used to obtain above results ?
benchmark command:
$ ./llama-bench -m gemma2-9b-q4_0.gguf
The GGUF model file is pulled from https://ollama.com/library/gemma2
env check (I only use ipex-llm[cpp]
, so ipex-llm[xpu]
is not installed):
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.44.0
-----------------------------------------------------------------
torch=2.2.0+cpu
-----------------------------------------------------------------
ipex-llm WARNING: Package(s) not found: ipex-llm
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i5-12490F
CPU family: 6
Model: 151
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 2
CPU(s) scaling MHz: 33%
CPU max MHz: 4600.0000
CPU min MHz: 800.0000
-----------------------------------------------------------------
Total CPU Memory: 94.1574 GB
-----------------------------------------------------------------
Operating System:
Ubuntu 24.04 LTS \n \l
-----------------------------------------------------------------
Linux systemd-llm 6.10.4-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Aug 11 15:32:50 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
Version: 1.2.38.20240718
Build ID: 0db09695
Service:
Version: 1.2.38.20240718
Build ID: 0db09695
Level Zero Version: 1.16.0
-----------------------------------------------------------------
Driver Version 2023.16.12.0.12_195853.xmain-hotfix
Driver Version 2023.16.12.0.12_195853.xmain-hotfix
Driver UUID 32332e34-332e-3032-3736-343200000000
Driver Version 23.43.027642
-----------------------------------------------------------------
Driver related package version:
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) Arc(TM) A750 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0003-0000-000856a18086 |
| | PCI BDF Address: 0000:03:00.0 |
| | DRM Device: /dev/dri/card1 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
lspci: Unable to load libkmod resources: error -2
GPU0 Memory [size=8G
-----------------------------------------------------------------
lspci: Unable to load libkmod resources: error -2
03:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A750] (rev 08) (prog-if 00 [VGA controller])
Subsystem: Intel Corporation Device 1021
Flags: bus master, fast devsel, latency 0, IRQ 159
Memory at 80000000 (64-bit, non-prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=8G]
Expansion ROM at 81000000 [disabled] [size=2M]
Capabilities: <access denied>
Kernel driver in use: i915
-----------------------------------------------------------------
Hi @ruihe774 , we can't reproduce this issue on our Linux A750 machine.
$
./build/bin/llama-bench -m ~/gemma2-9b-it-q4_0.gguf
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A750 Graphics| 1.3| 448| 1024| 32| 8096M| 1.3.27191|
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | pp512 | 664.75 ± 1.96 |
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | tg128 | 16.33 ± 0.07 |
build: cfac111e (3605)
$ ./llama-bench -m ~/gemma2-9b-it-q4_0.gguf
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A750 Graphics| 1.3| 448| 1024| 32| 8096M| 1.3.27191|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:448
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | pp512 | 1211.28 ± 3.67 |
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | tg128 | 17.21 ± 0.10 |
build: f6b084d (1)
and here is the info of our machine:
$ ./env-check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.44.0
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm ./env-check.sh: /home/yons/miniforge3/envs/qiyue-cpp-0820/bin/pip: /home/arda/miniforge3/envs/qiyue-cpp-0820/bin/python3.11: bad interpreter: No such file or directory
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 28
On-line CPU(s) list: 0-27
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-14700K
CPU family: 6
Model: 183
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 1
Stepping: 1
CPU max MHz: 5600.0000
CPU min MHz: 800.0000
BogoMIPS: 6835.20
-----------------------------------------------------------------
Total CPU Memory: 62.5534 GB
Memory Type: DDR5-A1
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.3 LTS \n \l
-----------------------------------------------------------------
Linux yons-B760M-AORUS-ELITE-AX 6.5.0-25-generic #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
Version: 1.2.22.20231126
Build ID: 00000000
Service:
Version: 1.2.22.20231126
Build ID: 00000000
Level Zero Version: 1.14.0
-----------------------------------------------------------------
Driver Version 2023.16.12.0.12_195853.xmain-hotfix
Driver Version 2023.16.12.0.12_195853.xmain-hotfix
Driver UUID 32332e33-352e-3237-3139-312e34320000
Driver Version 23.35.27191.42
Driver UUID 32332e33-352e-3237-3139-312e34320000
Driver Version 23.35.27191.42
-----------------------------------------------------------------
Driver related package version:
rc intel-fw-gpu 2023.39.2-255~22.04 all Firmware package for Intel integrated and discrete GPUs
ii intel-level-zero-gpu 1.3.27191.42-775~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
igpu detected
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO [23.35.27191.42]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.27191]
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) UHD Graphics 770 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0200-0000-0004a7808086 |
| | PCI BDF Address: 0000:00:02.0 |
| | DRM Device: /dev/dri/card0 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 1 | Device Name: Intel(R) Arc(TM) A750 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0003-0000-000856a18086 |
| | PCI BDF Address: 0000:03:00.0 |
| | DRM Device: /dev/dri/card1 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16M
GPU1 Memory [size=8G
-----------------------------------------------------------------
00:02.0 VGA compatible controller: Intel Corporation Device a780 (rev 04) (prog-if 00 [VGA controller])
DeviceName: Onboard - Video
Subsystem: Gigabyte Technology Co., Ltd Device d000
Flags: bus master, fast devsel, latency 0, IRQ 173
Memory at 6202000000 (64-bit, non-prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=256M]
I/O ports at 5000 [size=64]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: <access denied>
--
03:00.0 VGA compatible controller: Intel Corporation Device 56a1 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 1ef7:1395
Flags: bus master, fast devsel, latency 0, IRQ 174
Memory at 41000000 (64-bit, non-prefetchable) [size=16M]
Memory at 6000000000 (64-bit, prefetchable) [size=8G]
Expansion ROM at 42000000 [disabled] [size=2M]
Capabilities: <access denied>
Kernel driver in use: i915
Kernel modules: i915
-----------------------------------------------------------------
It seems your upstream performance is much better than we obtained. btw, I wonder have you set below env variables before running llama-bench ?
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
I did not set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
. I'm using a very new kernel (6.10) and I found setting it degraded the performance.
With SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
set, the performance of upstream llama.cpp:
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A750 Graphics| 1.3| 448| 1024| 32| 8096M| 1.3.27642|
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | pp512 | 635.04 ± 3.33 |
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | tg128 | 15.83 ± 0.09 |
build: cfac111e (3605)
Without:
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A750 Graphics| 1.3| 448| 1024| 32| 8096M| 1.3.27642|
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | pp512 | 638.34 ± 6.54 |
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | tg128 | 20.62 ± 0.11 |
build: cfac111e (3605)
The performance of llama.cpp in ipex-llm[cpp]
is not affect by SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS
in my benchmark.
FWIW 20t/s is still very slow IMO. My computer can run qwen2 7b at 60t/s and llama3.1 9b at 40t/s. I wonder if gemma2 is not optimized yet or it demands more computational effort than other models.
I did not set
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
. I'm using a very new kernel (6.10) and I found setting it degraded the performance.
Thanks for pointing this ! Without this env setting,
ggml_sycl_init: found 1 SYCL devices:
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A750 Graphics| 1.3| 448| 1024| 32| 8096M| 1.3.27191|
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | pp512 | 677.92 ± 1.45 |
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | tg128 | 22.22 ± 0.03 |
build: cfac111e (3605)
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A750 Graphics| 1.3| 448| 1024| 32| 8096M| 1.3.27191|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:448
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | pp512 | 827.55 ± 1.44 |
| gemma2 9B Q4_0 | 5.76 GiB | 10.16 B | SYCL | 99 | tg128 | 22.32 ± 0.10 |
build: f6b084d (1)
It seems that the performance difference may be related to the kernel.
FWIW 20t/s is still very slow IMO. My computer can run qwen2 7b at 60t/s and llama3.1 9b at 40t/s. I wonder if gemma2 is not optimized yet or it demands more computational effort than other models.
In fact, both of these two factors are involved. Gemma2-9b itself has a larger model size and more complex calculations, and we have not yet optimized this model further.
Thank you for pointing this out!
Hi @rnwang04. I have a guess that the regression with a new kernel is caused by disabling automatic load CCS load balancing. With the new kernel (6.9 and 6.10), my card has only one CCS engine:
$ ls -d /sys/devices/path/to/drm/card1/engine/ccs*
ccs0
While on the old kernel (pre-6.8) that is not affected by the regression, my card has four:
$ ls -d /sys/devices/path/to/drm/card1/engine/ccs*
ccs0 ccs1 ccs2 ccs3
I have run my benchmark using the same version of compute-runtime (30049). The only difference is the kernel version.
I have no idea why the upstream llama.cpp is not affected by the regression. Maybe because it is compiled with newer libraries?
I wonder if you could reproduce the regression on the new kernel and were interested in fixing it.
Hi. I found the token generation speed of gemma2 in llama.cpp in
ipex-llm[cpp]
is slower than upstream llama.cpp. Can it be optimized?ipex-llm[cpp]
:upstream: