intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.69k stars 1.26k forks source link

Running 2 x A770 with Ollama, inference responses slow down dramatically #10847

Closed digitalscream closed 6 months ago

digitalscream commented 6 months ago

System specs: 2 x A770, one on PCIE 3.0 x16, one on PCIE 3.0 x4, Ryzen 3600, 64GB RAM (EDIT: Host is Ubuntu 22.04, Docker base image is intelanalytics/ipex-llm-xpu:latest).

When running Ollama in Docker, using the setup described here:

https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html

...it loads large models just fine (eg Codellama 34b, with approx. 10GB on each GPU), but the response generation starts around 30t/s and gradually slows down to 5t/s within the same response.

At the same time, the GPU frequency slows from 1Ghz+ down to 100-300MHz.

The GPUs aren't power-starved, and they're running in open air (so no cooling problems, ambient temp is ~18C).

Worth noting that I'm running with source ipex-llm-init -c -g in the container startup and --no-mmap on llama.cpp (it segfaults without the former, and hangs or outright crashes without the latter).

I know it's not ideal running the second GPU on PCIE x4, but I'd have thought that'd cause general slowness rather than a gradual slowdown.

It shows the exact same behaviour with smaller models, too - eg LLama 3 8b and Mistral 7b. With those smaller models, running on a single GPU returns 50-60t/s.

digitalscream commented 6 months ago

I've been checking this section of the Ollama logs, just to make sure it's using both GPUs and getting past the 8GB limit mentioned in tickets from a few months back:

[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 6 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
| 1|[level_zero:gpu:1]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
| 2|    [opencl:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       3.0|        512|    1024|     32|    16225243136|
| 3|    [opencl:gpu:1]|               Intel(R) Arc(TM) A770 Graphics|       3.0|        512|    1024|     32|    16225243136|
| 4|    [opencl:cpu:0]|AMD Ryzen 5 3600 6-Core Processor              |       3.0|         12|    8192|     64|    67349479424|
| 5|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         12|67108864|     64|    67349479424|
ggml_backend_sycl_set_mul_device_mode: true
detect 2 SYCL GPUs: [0,1] with top Max compute units:512
llm_load_tensors: ggml ctx size =    0.50 MiB
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  9282.81 MiB
llm_load_tensors:      SYCL1 buffer size =  8745.30 MiB
llm_load_tensors:  SYCL_Host buffer size =   140.62 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   200.00 MiB
llama_kv_cache_init:      SYCL1 KV buffer size =   184.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.15 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   324.00 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =   324.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    20.01 MiB
llama_new_context_with_model: graph nodes  = 1590
llama_new_context_with_model: graph splits = 3

Is there anything else I should be looking at to figure out what's going on, or is that enough info?

digitalscream commented 6 months ago

Here's a quick screenshot of intel_gpu_top on each card, after the first sentence of a response in llama3:8b:

Screenshot from 2024-04-22 15-38-39

Interestingly, it stayed pretty much constant like that to the end of the response, with the blitter on card1 busy at up to 45% and neither card getting up much further than 300MHz.

This is similar behaviour to when I was trying to run vLLM in Docker with old drivers on the host machine. Could this be a driver issue?

yangw1234 commented 6 months ago

Hi @digitalscream , I did not reproduce this problem. (And I didn't need source ipex-llm-init -c -g and --no-mmap either) (diver version: 23.17.26241.33)

I think it is possible that this is a driver issue. Would you mind running this https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output?

sgwhat commented 6 months ago

Hi @digitalscream,

  1. Could you please provide the detailed performance data returned by the ollama server? such as below:
{"function":"print_timings","level":"INFO","line":272,"msg":"prompt eval time     =     197.77 ms /    32 tokens (    6.18 ms per token,   161.80 tokens per second)","n_prompt_tokens_processed":32,"n_tokens_second":161.80329775346232,"slot_id":0,"t_prompt_processing":197.77100000000002,"t_token":6.1803437500000005,"task_id":10,"tid":"140241139329024","timestamp":1713827080}
{"function":"print_timings","level":"INFO","line":286,"msg":"generation eval time =    1998.17 ms /   102 runs   (   19.59 ms per token,    51.05 tokens per second)","n_decoded":102,"n_tokens_second":51.04668219086354,"slot_id":0,"t_token":19.58991176470588,"t_token_generation":1998.171,"task_id":10,"tid":"140241139329024","timestamp":1713827080}
  1. I want to confirm that the issue of the speed decrease occurs within the same generation round, am I correct?
digitalscream commented 6 months ago

Hi @digitalscream , I did not reproduce this problem. (And I didn't need source ipex-llm-init -c -g and --no-mmap either) (diver version: 23.17.26241.33)

I think it is possible that this is a driver issue. Would you mind running this https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output?

Run on host:

-----------------------------------------------------------------
-e PYTHON_VERSION=3.12.1
env-check.sh: 15: [[: not found
env-check.sh: 15: [[: not found
-----------------------------------------------------------------
Transformers is not installed. 
-----------------------------------------------------------------
PyTorch is not installed. 
-----------------------------------------------------------------
ipex-llm WARNING: Package(s) not found: ipex-llm
-----------------------------------------------------------------
IPEX is not installed. 
-----------------------------------------------------------------
CPU Information: 
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      43 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             12
On-line CPU(s) list:                0-11
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 5 3600 6-Core Processor
CPU family:                         23
Model:                              113
Thread(s) per core:                 2
Core(s) per socket:                 6
Socket(s):                          1
Stepping:                           0
Frequency boost:                    enabled
CPU(s) scaling MHz:                 93%
CPU max MHz:                        4208.2031
-----------------------------------------------------------------
Total CPU Memory: 62.7241 GB
-----------------------------------------------------------------
Operating System: 
Ubuntu 23.10 \n \l

-----------------------------------------------------------------
Linux osiris 6.5.0-28-generic #29-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 28 23:46:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.31.20240308
    Build ID: 00000000

Service:
    Version: 1.2.31.20240308
    Build ID: 00000000
    Level Zero Version: 1.16.0
-----------------------------------------------------------------
  Driver UUID                                     32332e35-322e-3238-3230-322e35310000
  Driver Version                                  23.52.28202.51
  Driver UUID                                     32332e35-322e-3238-3230-322e35310000
  Driver Version                                  23.52.28202.51
  Driver Version                                  2024.17.3.0.08_160000
  Driver Version                                  2024.17.3.0.08_160000
-----------------------------------------------------------------
env-check.sh: 160: sycl-ls: not found
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed. 
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0003-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:03:00.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 1         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-000c-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:0c:00.0                                                        |
|           | DRM Device: /dev/dri/card1                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16G
GPU1 Memory size=16G
-----------------------------------------------------------------
03:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
    Subsystem: Acer Incorporated [ALI] DG2 [Arc A770]
    Flags: bus master, fast devsel, latency 0, IRQ 63, IOMMU group 17
    Memory at fb000000 (64-bit, non-prefetchable) [size=16M]
    Memory at f800000000 (64-bit, prefetchable) [size=16G]
    Expansion ROM at fc000000 [disabled] [size=2M]
    Capabilities: <access denied>
    Kernel driver in use: i915
    Kernel modules: i915
--
0c:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
    Subsystem: Acer Incorporated [ALI] DG2 [Arc A770]
    Flags: bus master, fast devsel, latency 0, IRQ 66, IOMMU group 23
    Memory at f9000000 (64-bit, non-prefetchable) [size=16M]
    Memory at f000000000 (64-bit, prefetchable) [size=16G]
    Expansion ROM at fa000000 [disabled] [size=2M]
    Capabilities: <access denied>
    Kernel driver in use: i915
    Kernel modules: i915
-----------------------------------------------------------------

Run inside Docker container:

-----------------------------------------------------------------
-e PYTHON_VERSION=3.12.2
env-check.sh: 15: [[: not found
env-check.sh: 15: [[: not found
-----------------------------------------------------------------
/opt/conda/lib/python3.12/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
transformers=4.36.0
-----------------------------------------------------------------
torch=2.2.2+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240419
-----------------------------------------------------------------
IPEX is not installed. 
-----------------------------------------------------------------
CPU Information: 
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      43 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             12
On-line CPU(s) list:                0-11
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 5 3600 6-Core Processor
CPU family:                         23
Model:                              113
Thread(s) per core:                 2
Core(s) per socket:                 6
Socket(s):                          1
Stepping:                           0
Frequency boost:                    enabled
CPU max MHz:                        4208.2031
CPU min MHz:                        2200.0000
-----------------------------------------------------------------
Total CPU Memory: 62.7241 GB
-----------------------------------------------------------------
Operating System: 
Ubuntu 22.04.3 LTS \n \l

-----------------------------------------------------------------
Linux 4866245ba6bb 6.5.0-28-generic #29-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 28 23:46:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.13.20230704
    Build ID: 00000000

Service:
    Version: 1.2.13.20230704
    Build ID: 00000000
    Level Zero Version: 1.14.0
-----------------------------------------------------------------
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver UUID                                     32332e33-352e-3237-3139-312e34320000
  Driver Version                                  23.35.27191.42
  Driver UUID                                     32332e33-352e-3237-3139-312e34320000
  Driver Version                                  23.35.27191.42
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed. 
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | UUID: 00000000-0000-0003-0000-000856a08086                                           |
|           | PCI BDF Address: 0000:03:00.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 1         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | UUID: 00000000-0000-000c-0000-000856a08086                                           |
|           | PCI BDF Address: 0000:0c:00.0                                                        |
|           | DRM Device: /dev/dri/card1                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
env-check.sh: 170: lspci: not found
-----------------------------------------------------------------
env-check.sh: 136: lspci: not found
-----------------------------------------------------------------

Notable is the fact that the host driver version is 23.52.28202.51, but the container's version is 23.35.27191.42 (I'm using the image provided under intelanalytics/ipex-llm-xpu:latest).

digitalscream commented 6 months ago

Hi @digitalscream,

  1. Could you please provide the detailed performance data returned by the ollama server? such as below:
{"function":"print_timings","level":"INFO","line":272,"msg":"prompt eval time     =     197.77 ms /    32 tokens (    6.18 ms per token,   161.80 tokens per second)","n_prompt_tokens_processed":32,"n_tokens_second":161.80329775346232,"slot_id":0,"t_prompt_processing":197.77100000000002,"t_token":6.1803437500000005,"task_id":10,"tid":"140241139329024","timestamp":1713827080}
{"function":"print_timings","level":"INFO","line":286,"msg":"generation eval time =    1998.17 ms /   102 runs   (   19.59 ms per token,    51.05 tokens per second)","n_decoded":102,"n_tokens_second":51.04668219086354,"slot_id":0,"t_token":19.58991176470588,"t_token_generation":1998.171,"task_id":10,"tid":"140241139329024","timestamp":1713827080}
  1. I want to confirm that the issue of the speed decrease occurs within the same generation round, am I correct?

1 - Here you go:

{"function":"update_slots","level":"INFO","line":1574,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"134899419920576","timestamp":1713861125}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":1,"tid":"134899419920576","timestamp":1713861125}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":0,"tid":"134899419920576","timestamp":1713861125}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":2,"tid":"134899419920576","timestamp":1713861125}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":59706,"status":200,"tid":"134891193443904","timestamp":1713861125}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":4,"tid":"134899419920576","timestamp":1713861125}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":59714,"status":200,"tid":"134891185051200","timestamp":1713861125}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":3,"tid":"134899419920576","timestamp":1713861125}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":59740,"status":200,"tid":"134891168265792","timestamp":1713861125}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":5,"tid":"134899419920576","timestamp":1713861125}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":59704,"status":200,"tid":"134891201836608","timestamp":1713861125}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":59692,"status":200,"tid":"134891210229312","timestamp":1713861125}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":59728,"status":200,"tid":"134891176658496","timestamp":1713861125}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":6,"tid":"134899419920576","timestamp":1713861125}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":53470,"status":200,"tid":"134891159873088","timestamp":1713861125}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7,"tid":"134899419920576","timestamp":1713861125}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":53470,"status":200,"tid":"134891159873088","timestamp":1713861125}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":53470,"status":200,"tid":"134891159873088","timestamp":1713861125}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":8,"tid":"134899419920576","timestamp":1713861125}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":53470,"status":200,"tid":"134891159873088","timestamp":1713861125}
{"function":"launch_slot_with_data","level":"INFO","line":829,"msg":"slot is processing task","slot_id":0,"task_id":9,"tid":"134899419920576","timestamp":1713861126}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1812,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":20,"slot_id":0,"task_id":9,"tid":"134899419920576","timestamp":1713861126}
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":9,"tid":"134899419920576","timestamp":1713861126}
{"function":"print_timings","level":"INFO","line":272,"msg":"prompt eval time     =     195.22 ms /    20 tokens (    9.76 ms per token,   102.45 tokens per second)","n_prompt_tokens_processed":20,"n_tokens_second":102.44747005972687,"slot_id":0,"t_prompt_processing":195.222,"t_token":9.7611,"task_id":9,"tid":"134899419920576","timestamp":1713861214}
{"function":"print_timings","level":"INFO","line":286,"msg":"generation eval time =   87912.34 ms /   760 runs   (  115.67 ms per token,     8.64 tokens per second)","n_decoded":760,"n_tokens_second":8.644975013860966,"slot_id":0,"t_token":115.67413421052632,"t_token_generation":87912.342,"task_id":9,"tid":"134899419920576","timestamp":1713861214}
{"function":"print_timings","level":"INFO","line":295,"msg":"          total time =   88107.56 ms","slot_id":0,"t_prompt_processing":195.222,"t_token_generation":87912.342,"t_total":88107.564,"task_id":9,"tid":"134899419920576","timestamp":1713861214}
{"function":"update_slots","level":"INFO","line":1644,"msg":"slot released","n_cache_tokens":780,"n_ctx":2048,"n_past":779,"n_system_tokens":0,"slot_id":0,"task_id":9,"tid":"134899419920576","timestamp":1713861214,"truncated":false}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":53470,"status":200,"tid":"134891159873088","timestamp":1713861214}
[GIN] 2024/04/23 - 16:33:34 | 200 |         1m51s |      172.17.0.1 | POST     "/api/chat"
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":772,"tid":"134899419920576","timestamp":1713861214}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41774,"status":200,"tid":"134891151480384","timestamp":1713861214}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":773,"tid":"134899419920576","timestamp":1713861214}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41774,"status":200,"tid":"134891151480384","timestamp":1713861214}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":41774,"status":200,"tid":"134891151480384","timestamp":1713861214}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":774,"tid":"134899419920576","timestamp":1713861214}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41774,"status":200,"tid":"134891151480384","timestamp":1713861214}
{"function":"launch_slot_with_data","level":"INFO","line":829,"msg":"slot is processing task","slot_id":0,"task_id":775,"tid":"134899419920576","timestamp":1713861214}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1812,"msg":"slot progression","n_past":5,"n_past_se":0,"n_prompt_tokens_processed":55,"slot_id":0,"task_id":775,"tid":"134899419920576","timestamp":1713861214}
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":5,"slot_id":0,"task_id":775,"tid":"134899419920576","timestamp":1713861214}
{"function":"print_timings","level":"INFO","line":272,"msg":"prompt eval time     =    1120.91 ms /    55 tokens (   20.38 ms per token,    49.07 tokens per second)","n_prompt_tokens_processed":55,"n_tokens_second":49.06714437248921,"slot_id":0,"t_prompt_processing":1120.913,"t_token":20.380236363636364,"task_id":775,"tid":"134899419920576","timestamp":1713861215}
{"function":"print_timings","level":"INFO","line":286,"msg":"generation eval time =     404.89 ms /     9 runs   (   44.99 ms per token,    22.23 tokens per second)","n_decoded":9,"n_tokens_second":22.228259527279015,"slot_id":0,"t_token":44.98777777777778,"t_token_generation":404.89,"task_id":775,"tid":"134899419920576","timestamp":1713861215}
{"function":"print_timings","level":"INFO","line":295,"msg":"          total time =    1525.80 ms","slot_id":0,"t_prompt_processing":1120.913,"t_token_generation":404.89,"t_total":1525.8029999999999,"task_id":775,"tid":"134899419920576","timestamp":1713861215}
{"function":"update_slots","level":"INFO","line":1644,"msg":"slot released","n_cache_tokens":69,"n_ctx":2048,"n_past":68,"n_system_tokens":0,"slot_id":0,"task_id":775,"tid":"134899419920576","timestamp":1713861215,"truncated":false}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":41774,"status":200,"tid":"134891151480384","timestamp":1713861215}
[GIN] 2024/04/23 - 16:33:35 | 200 |   1.66282881s |      172.17.0.1 | POST     "/v1/chat/completions"

You'll note that the second request to /v1/completions runs at 22t/s, whereas the first ("Tell me a story in 1000 words") is 8t/s. As described originally, it started off fast, and then slowed right down to a crawl. If I change the question to "Give me a random sentence of 40 words), I get this:

{"function":"print_timings","level":"INFO","line":272,"msg":"prompt eval time     =    2312.65 ms /   806 tokens (    2.87 ms per token,   348.52 tokens per second)","n_prompt_tokens_processed":806,"n_tokens_second":348.51853677562406,"slot_id":0,"t_prompt_processing":2312.646,"t_token":2.8692878411910674,"task_id":791,"tid":"134899419920576","timestamp":1713861508}
{"function":"print_timings","level":"INFO","line":286,"msg":"generation eval time =    2637.05 ms /    59 runs   (   44.70 ms per token,    22.37 tokens per second)","n_decoded":59,"n_tokens_second":22.373451586505244,"slot_id":0,"t_token":44.69583050847458,"t_token_generation":2637.054,"task_id":791,"tid":"134899419920576","timestamp":1713861508}

2 - Yes, the slowdown occurs within a single generation round. Initial speed is (guessing) ~35t/s, then it quickly slows down after the first 20 or so tokens.

EDIT: For comparison, this is the same query running on the same system, with a single A770:

{"function":"print_timings","level":"INFO","line":272,"msg":"prompt eval time     =     194.61 ms /    20 tokens (    9.73 ms per token,   102.77 tokens per second)","n_prompt_tokens_processed":20,"n_tokens_second":102.7691137705474,"slot_id":0,"t_prompt_processing":194.611,"t_token":9.73055,"task_id":9,"tid":"136227881986240","timestamp":1713868485}
{"function":"print_timings","level":"INFO","line":286,"msg":"generation eval time =   18152.19 ms /   703 runs   (   25.82 ms per token,    38.73 tokens per second)","n_decoded":703,"n_tokens_second":38.72809872304699,"slot_id":0,"t_token":25.821045519203413,"t_token_generation":18152.195,"task_id":9,"tid":"136227881986240","timestamp":1713868485}
....
{"function":"print_timings","level":"INFO","line":272,"msg":"prompt eval time     =     415.31 ms /    55 tokens (    7.55 ms per token,   132.43 tokens per second)","n_prompt_tokens_processed":55,"n_tokens_second":132.4321526003655,"slot_id":0,"t_prompt_processing":415.307,"t_token":7.5510363636363635,"task_id":718,"tid":"136227881986240","timestamp":1713868486}
{"function":"print_timings","level":"INFO","line":286,"msg":"generation eval time =     227.38 ms /    10 runs   (   22.74 ms per token,    43.98 tokens per second)","n_decoded":10,"n_tokens_second":43.979628636015796,"slot_id":0,"t_token":22.7378,"t_token_generation":227.37800000000001,"task_id":718,"tid":"136227881986240","timestamp":1713868486}
sgwhat commented 6 months ago

Hi @digitalscream ,

We have tested the performance of llama3-8b on our dual ARC770 machine and found that it is somewhat slower compared to a single ARC770 card, which is normal for currently. However, we could not replicate the large performance gap that you experienced (115.67ms on dual cards and 25.82ms on single card).

Based on our investigation, this might be due to your GPU driver not being properly installed, for example, the version of xpu system management interface and level zero do not match in your docker image.

CLI:
    Version: 1.2.13.20230704
    Build ID: 00000000

Service:
    Version: 1.2.13.20230704
    Build ID: 00000000
    Level Zero Version: 1.14.0

So please reinstall following our guide https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#for-linux-kernel-6-5.

digitalscream commented 6 months ago

@sgwhat - fair enough, I'll give that a go. I must admit, I installed the latest drivers on the host and then just used the provided intelanalytics/ipex-llm-xpu:latest image on the assumption that the drivers in the image would be up to date and properly installed.

Do you know how often that image gets updated? I'm happy to build it from a Ubuntu base image, of course, but it'd be much nicer if the provided image was already there :)

sgwhat commented 6 months ago

hi @digitalscream , intelanalytics/ipex-llm-xpu:latest image will be updated everyday, and please make sure that the gpu driver on your host has been installed properly with following our document. Moreover, you may pull the latest image and run ollama inside directly, without installing gpu driver.

digitalscream commented 6 months ago

OK, I've updated the host drivers - both are showing the same updated version numbers now. However, the same behaviour remains (it's slightly faster, at ~10t/s average over the course of the "Tell me a story in 1000 words" prompt).

I've tried running it with both the intelanalytics/ipex-llm-xpu:latest and a raw Ubuntu 23.10 image with the drivers installed, and it's the exact same.

This is the startup script:

source /opt/intel/oneapi/setvars.sh

source ipex-llm-init -c -g

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export ZES_ENABLE_SYSMAN=1

no_proxy=localhost,0.0.0.0

# Find the value for --model-path using sed
model_path=$(echo "$@" | awk -F'--model-path ' '{print $2}' | awk '{print $1}')

# Print args
echo "Model path: $model_path"
echo "Worker args: $@"

cd /llm/bin

init-llama-cpp
init-ollama

OLLAMA_DEFAULT_KEEPALIVE="-1m" OLLAMA_HOST="0.0.0.0:11434" ./ollama serve &

sleep 2

./ollama create llama_local -f llama3.model

/bin/bash

Am I doing anything obviously wrong there (aside from the unused variables, I'm just experimenting to get it working)?

sgwhat commented 6 months ago

Your startup script is fine. Could you please double-check the image system environment by running xpu-smi -v and clinfo | fgrep?

digitalscream commented 6 months ago

Your startup script is fine. Could you please double-check the image system environment by running xpu-smi -v and clinfo | fgrep?

Yup, here it is for host and container respectively:

(base) lj@osiris:~$ xpu-smi -v && clinfo | grep Driver
CLI:
    Version: 1.2.31.20240308
    Build ID: 00000000

Service:
    Version: 1.2.31.20240308
    Build ID: 00000000
    Level Zero Version: 1.16.0
  Driver UUID                                     32332e35-322e-3238-3230-322e35310000
  Driver Version                                  23.52.28202.51
  Driver UUID                                     32332e35-322e-3238-3230-322e35310000
  Driver Version                                  23.52.28202.51
(base) lj@osiris:~$ 
root@797c2f6515b1:/llm# xpu-smi -v && clinfo | grep Driver
CLI:
    Version: 1.2.31.20240308
    Build ID: 00000000

Service:
    Version: 1.2.31.20240308
    Build ID: 00000000
    Level Zero Version: 1.16.0
  Driver UUID                                     32332e35-322e-3238-3230-322e35310000
  Driver Version                                  23.52.28202.51
  Driver UUID                                     32332e35-322e-3238-3230-322e35310000
  Driver Version                                  23.52.28202.51
sgwhat commented 6 months ago

hi @digitalscream, you don't need to setup gpu driver in the docker image, and xpu-smi -v and clinfo should not work neither. Please could you try to run ollama in the image directly? Also is that possible for you to test directly on your host?

Note: You only need to install gpu driver on host, please don't install any driver in your container.

digitalscream commented 6 months ago

hi @digitalscream, you don't need to setup gpu driver in the docker image, and xpu-smi -v and clinfo should not work neither. Please could you try to run ollama in the image directly? Also is that possible for you to test directly on your host?

Note: You only need to install gpu driver on host, please don't install any driver in your container.

I can't really run it on the host system bare at the moment, because of weird Python issues this machine's had for ages (hence running it in Docker).

I've just tried running it without the driver in the container - Ollama fails to run any models because it can't find the SYCL devices:

[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 2 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|    [opencl:cpu:0]|AMD Ryzen 5 3600 6-Core Processor              |       3.0|         12|    8192|     64|    67349487616|
| 1|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         12|67108864|     64|    67349487616|
ggml_backend_sycl_set_mul_device_mode: true
llama_model_load: error loading model: DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/root/.ollama/models/blobs/sha256-3f75702e9f27f9b481928f2df8e2c011ae9b27e18821edc48ce5953160a2cb93'
{"function":"load_model","level":"ERR","line":406,"model":"/root/.ollama/models/blobs/sha256-3f75702e9f27f9b481928f2df8e2c011ae9b27e18821edc48ce5953160a2cb93","msg":"unable to load model","tid":"126633827444736","timestamp":1714026707}
time=2024-04-25T06:31:47.687Z level=ERROR source=server.go:285 msg="error starting llama server" server=cpu_avx2 error="llama runner process no longer running: 1 error:failed to load model '/root/.ollama/models/blobs/sha256-3f75702e9f27f9b481928f2df8e2c011ae9b27e18821edc48ce5953160a2cb93'"
time=2024-04-25T06:31:47.687Z level=ERROR source=server.go:293 msg="unable to load any llama server" error="llama runner process no longer running: 1 error:failed to load model '/root/.ollama/models/blobs/sha256-3f75702e9f27f9b481928f2df8e2c011ae9b27e18821edc48ce5953160a2cb93'"

With your dual-A770 machine, are you running it bare or with Docker? Would it help if I gave you the files for my image to see if you can replicate it there?

sgwhat commented 6 months ago

Hi @digitalscream, it seems your xpu is not successfully mapped into the container. Have you booted the container following our document? (see xpu docker document for more details).

digitalscream commented 6 months ago

I believe it is, yes - remember, this works perfectly under the same Docker images with a single GPU.

I think we might be going down a bit of a blind alley here - that was all with the bare image installed from scratch. I've just gone back to the intelanalytics/ipex-llm-xpu:latest base image (which is presumably configured correctly), and it's still the same. I've tried with all three available memory allocators (mainly out of curiosity), which hasn't made much of a difference in performance (1-2t/s).

Could the answer simply be that the second GPU is running on a PCIE 3.0 x4 interface? Weirdly, it does seem to be the one doing most of the work (even though it's listed as /dev/dri/card1, so I'd expect /dev/dri/card0 to be the primary). I can't find any way of monitoring the PCIE bandwidth in use, because this machine has an AMD CPU (sorry...). The only other machine I have with multiple x8 slots is an old Xeon E5 machine, which doesn't support AVX so the Arc drivers won't install.

If that's a genuine possibility, I'm OK with sending the second card back (I can't use it on my desktop because the Arc drivers have no fan control under Linux).

digitalscream commented 6 months ago

OK, some useful information - if I use:

export ONEAPI_DEVICE_SELECTOR=level_zero:0

...to limit Ollama to using either card (0 or 1) in the above startup.sh file, it works absolutely fine - card0 (in the x16 slot) runs around 43-45t/s with Llama3, card1 (in the x4 slot) runs at 40-41t/s. If nothing else, it's verification that the second card is definitely functioning correctly and not limited by the x4 slot (the data loads slightly slower into VRAM, but not by much).

I tried forcing it to use OpenCL using the above environment variable, but Ollama fails to load the model. I suspect that's an Ollama problem more than anything.

sgwhat commented 6 months ago

hi @digitalscream , could you please try running a single-card task on each of the two GPUs simultaneously? We want to compare their performance to rule out any hardware issues.

For your information, we have tested on our dual cards machine, ollama could find sycl devices in our container without installing anything, and you may see the performance below:

# Dual Arc770 cards
# Meta-Llama-3-8B.Q4_0.gguf
# Question: Tell a 1000 words story

{"function":"print_timings","level":"INFO","line":276,"msg":"prompt eval time     =     238.61 ms /     6 tokens (   39.77 ms per token,    25.15 tokens per second)","n_prompt_tokens_processed":6,"n_tokens_second":25.1454243709453,"slot_id":0,"t_prompt_processing":238.612,"t_token":39.76866666666667,"task_id":8,"tid":"140197933881344","timestamp":1714098227}
{"function":"print_timings","level":"INFO","line":290,"msg":"generation eval time =   24821.89 ms /   786 runs   (   31.58 ms per token,    31.67 tokens per second)","n_decoded":786,"n_tokens_second":31.66560221630209,"slot_id":0,"t_token":31.580008905852416,"t_token_generation":24821.887,"task_id":8,"tid":"140197933881344","timestamp":1714098227}

# Single Arc770 card
{"function":"print_timings","level":"INFO","line":276,"msg":"prompt eval time     =     196.73 ms /     6 tokens (   32.79 ms per token,    30.50 tokens per second)","n_prompt_tokens_processed":6,"n_tokens_second":30.49803287687944,"slot_id":0,"t_prompt_processing":196.734,"t_token":32.789,"task_id":8,"tid":"140409288452096","timestamp":1714098757}
{"function":"print_timings","level":"INFO","line":290,"msg":"generation eval time =   11001.65 ms /   642 runs   (   17.14 ms per token,    58.35 tokens per second)","n_decoded":642,"n_tokens_second":58.354861914399415,"slot_id":0,"t_token":17.136532710280374,"t_token_generation":11001.654,"task_id":8,"tid":"140409288452096","timestamp":1714098757}
digitalscream commented 6 months ago

@sgwhat - results for card0 (x16):

prompt eval time     =     226.89 ms /    19 tokens (   11.94 ms per token,    83.74 tokens per second)","n_prompt_tokens_processed":19,"n_tokens_second":83.74138896112196,"slot_id":0,"t_prompt_processing":226.889,"t_token":11.941526315789474,"task_id":10,"tid":"131955477649408","timestamp":1714111964}
generation eval time =   13915.46 ms /   532 runs   (   26.16 ms per token,    38.23 tokens per second)","n_decoded":532,"n_tokens_second":38.230868019641754,"slot_id":0,"t_token":26.156874060150376,"t_token_generation":13915.457,"task_id":10,"tid":"131955477649408","timestamp":1714111964}

card1 (x4):

prompt eval time     =     222.50 ms /    18 tokens (   12.36 ms per token,    80.90 tokens per second)","n_prompt_tokens_processed":18,"n_tokens_second":80.89996719056886,"slot_id":0,"t_prompt_processing":222.497,"t_token":12.360944444444446,"task_id":10,"tid":"123683911690240","timestamp":1714111965}
generation eval time =   15325.17 ms /   590 runs   (   25.97 ms per token,    38.50 tokens per second)","n_decoded":590,"n_tokens_second":38.49874722466446,"slot_id":0,"t_token":25.97487118644068,"t_token_generation":15325.174,"task_id":10,"tid":"123683911690240","timestamp":1714111965}

Incidentally, it does seem that even when running just one task on one card, it's slower with another GPU present than if there's just one card physically attached. As I said right up at the top, I normally get over 50t/s, but with two GPUs present it's closer to 40 (I have no idea if that's useful information or not).

In any case, running two tasks simultaneously does seem to rule out any immediate hardware issue, although there's still the possibility that the x4 link on the second card is holding it back if the cards need to talk to each other directly. For what it's worth, I do have ReBar and >4G decoding enabled, everything else is just stock settings in the BIOS.

EDIT: I've just removed the second GPU, and it's still showing the ~40t/s performance, so that's a red herring - I've probably caused that myself somewhere along the line.

digitalscream commented 6 months ago

Additional info: I've managed to get a conda environment running on the host machine, and it shows exactly the same behaviour: 37-38t/s for a single card, or starting at that level for the first 20-30 tokens on dual GPUs and then quickly dropping to ~8t/s.

sgwhat commented 6 months ago

Hi @digitalscream, sorry, we can't reproduce your issue. It could be possible related to your hardware or GPU driver installation. You may refer to our documentation https://ipex-tllm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#for-linux-kernel-6-5 for more details.

digitalscream commented 6 months ago

@sgwhat - for what it's worth, I've kinda solved the problem. It turns out that you were on the right track with the drivers - I had intel-i915-dkms and intel-fw-gpu installed from an older iteration of the server.

Removing those and updating means that I now see 30t/s when running llama3 on dual GPUs. Weirdly, though, I also see 30t/s when running on a single card (either specifying a single card with two in the machine or with just one physically installed) - exactly the same performance no matter how many cards are installed, but it's definitely using both (from xpu-smi-stats and intel_gpu_top).

That's the same behaviour I saw when I had a pair of Tesla P100s installed.

I think I need to just wipe this server at some point...it's had far too much installed and removed over the last year or so.

Thanks for your help!

digitalscream commented 6 months ago

Happy to close this one now; I've ended up sending the other card back, because running 20GB+ models (the whole point of the second card) is just too slow to be usable. I'll wait for Battlemage...