intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.75k stars 1.27k forks source link

Several GPU models behave erratically compared to CPU execution #12374

Open pepijndevos opened 2 weeks ago

pepijndevos commented 2 weeks ago

Here is a trace from my Intel Arc A770 via Docker:

$ ollama run  deepseek-coder-v2
>>> write fizzbuzz
"""""""""""""""""""""""""""""""

And here is an trace from Arch linux running on CPU:

$ ollama run  deepseek-coder-v2 
>>> write fizzbuzz
 Certainly! FizzBuzz is a classic programming task, often used in job interviews to test basic understanding of loops and conditionals. The task goes like this:

1. Print numbers from 1 to 100.
2. For multiples of 3, print "Fizz".
3. For multiples of 5, print "Buzz".
4. For multiples of both 3 and 5 (i.e., multiples of 15), print "FizzBuzz".

Here's a simple implementation in Python:

for i in range(1, 101):
    if i % 15 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)

This code will output the numbers from 1 to 100, replacing multiples of 3 with "Fizz", multiples of 5 with "Buzz", and multiples of both 3 and 5 with "FizzBuzz".

For Docker I'm using https://github.com/mattcurf/ollama-intel-gpu due to #12372

ollama logs:

ollama-intel-gpu  | time=2024-11-10T20:25:23.772Z level=INFO source=server.go:395 msg="starting llama server" cmd="/tmp/ollama3494697786/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --no-mmap --parallel 4 --port 40951"
ollama-intel-gpu  | time=2024-11-10T20:25:23.773Z level=INFO source=sched.go:450 msg="loaded runners" count=1
ollama-intel-gpu  | time=2024-11-10T20:25:23.773Z level=INFO source=server.go:595 msg="waiting for llama runner to start responding"
ollama-intel-gpu  | time=2024-11-10T20:25:23.773Z level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error"
ollama-intel-gpu  | INFO [main] build info | build=1 commit="6cbbf2a" tid="139094668663808" timestamp=1731270323
ollama-intel-gpu  | INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139094668663808" timestamp=1731270323 total_threads=32
ollama-intel-gpu  | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="40951" tid="139094668663808" timestamp=1731270323
ollama-intel-gpu  | llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /root/.ollama/models/blobs/sha256-5ff0abeeac1d2dbdd5455c0b49ba3b29a9ce3c1fb181b2eef2e948689d55d046 (version GGUF V3 (latest))
ollama-intel-gpu  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-intel-gpu  | llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
ollama-intel-gpu  | llama_model_loader: - kv   1:                               general.name str              = DeepSeek-Coder-V2-Lite-Instruct
ollama-intel-gpu  | llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
ollama-intel-gpu  | llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
ollama-intel-gpu  | llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
ollama-intel-gpu  | llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
ollama-intel-gpu  | llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
ollama-intel-gpu  | llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
ollama-intel-gpu  | llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
ollama-intel-gpu  | llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
ollama-intel-gpu  | llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
ollama-intel-gpu  | llama_model_loader: - kv  11:                          general.file_type u32              = 2
ollama-intel-gpu  | llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
ollama-intel-gpu  | llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
ollama-intel-gpu  | llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
ollama-intel-gpu  | llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
ollama-intel-gpu  | llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
ollama-intel-gpu  | llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
ollama-intel-gpu  | llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
ollama-intel-gpu  | llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
ollama-intel-gpu  | llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
ollama-intel-gpu  | llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
ollama-intel-gpu  | llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
ollama-intel-gpu  | llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
ollama-intel-gpu  | llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096
ollama-intel-gpu  | llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
ollama-intel-gpu  | llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
ollama-intel-gpu  | llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
ollama-intel-gpu  | llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-intel-gpu  | llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-intel-gpu  | llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
ollama-intel-gpu  | llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
ollama-intel-gpu  | llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
ollama-intel-gpu  | llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
ollama-intel-gpu  | llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
ollama-intel-gpu  | llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
ollama-intel-gpu  | llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
ollama-intel-gpu  | llama_model_loader: - kv  37:               general.quantization_version u32              = 2
ollama-intel-gpu  | llama_model_loader: - type  f32:  108 tensors
ollama-intel-gpu  | llama_model_loader: - type q4_0:  268 tensors
ollama-intel-gpu  | llama_model_loader: - type q6_K:    1 tensors
ollama-intel-gpu  | llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
ollama-intel-gpu  | llm_load_vocab: special tokens cache size = 2400
ollama-intel-gpu  | llm_load_vocab: token to piece cache size = 0.6661 MB
ollama-intel-gpu  | llm_load_print_meta: format           = GGUF V3 (latest)
ollama-intel-gpu  | llm_load_print_meta: arch             = deepseek2
ollama-intel-gpu  | llm_load_print_meta: vocab type       = BPE
ollama-intel-gpu  | llm_load_print_meta: n_vocab          = 102400
ollama-intel-gpu  | llm_load_print_meta: n_merges         = 99757
ollama-intel-gpu  | llm_load_print_meta: vocab_only       = 0
ollama-intel-gpu  | llm_load_print_meta: n_ctx_train      = 163840
ollama-intel-gpu  | llm_load_print_meta: n_embd           = 2048
ollama-intel-gpu  | llm_load_print_meta: n_layer          = 27
ollama-intel-gpu  | llm_load_print_meta: n_head           = 16
ollama-intel-gpu  | llm_load_print_meta: n_head_kv        = 16
ollama-intel-gpu  | llm_load_print_meta: n_rot            = 64
ollama-intel-gpu  | llm_load_print_meta: n_swa            = 0
ollama-intel-gpu  | llm_load_print_meta: n_embd_head_k    = 192
ollama-intel-gpu  | llm_load_print_meta: n_embd_head_v    = 128
ollama-intel-gpu  | llm_load_print_meta: n_gqa            = 1
ollama-intel-gpu  | llm_load_print_meta: n_embd_k_gqa     = 3072
ollama-intel-gpu  | llm_load_print_meta: n_embd_v_gqa     = 2048
ollama-intel-gpu  | llm_load_print_meta: f_norm_eps       = 0.0e+00
ollama-intel-gpu  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
ollama-intel-gpu  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
ollama-intel-gpu  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-intel-gpu  | llm_load_print_meta: f_logit_scale    = 0.0e+00
ollama-intel-gpu  | llm_load_print_meta: n_ff             = 10944
ollama-intel-gpu  | llm_load_print_meta: n_expert         = 64
ollama-intel-gpu  | llm_load_print_meta: n_expert_used    = 6
ollama-intel-gpu  | llm_load_print_meta: causal attn      = 1
ollama-intel-gpu  | llm_load_print_meta: pooling type     = 0
ollama-intel-gpu  | llm_load_print_meta: rope type        = 0
ollama-intel-gpu  | llm_load_print_meta: rope scaling     = yarn
ollama-intel-gpu  | llm_load_print_meta: freq_base_train  = 10000.0
ollama-intel-gpu  | llm_load_print_meta: freq_scale_train = 0.025
ollama-intel-gpu  | llm_load_print_meta: n_ctx_orig_yarn  = 4096
ollama-intel-gpu  | llm_load_print_meta: rope_finetuned   = unknown
ollama-intel-gpu  | llm_load_print_meta: ssm_d_conv       = 0
ollama-intel-gpu  | llm_load_print_meta: ssm_d_inner      = 0
ollama-intel-gpu  | llm_load_print_meta: ssm_d_state      = 0
ollama-intel-gpu  | llm_load_print_meta: ssm_dt_rank      = 0
ollama-intel-gpu  | llm_load_print_meta: ssm_dt_b_c_rms   = 0
ollama-intel-gpu  | llm_load_print_meta: model type       = 16B
ollama-intel-gpu  | llm_load_print_meta: model ftype      = Q4_0
ollama-intel-gpu  | llm_load_print_meta: model params     = 15.71 B
ollama-intel-gpu  | llm_load_print_meta: model size       = 8.29 GiB (4.53 BPW) 
ollama-intel-gpu  | llm_load_print_meta: general.name     = DeepSeek-Coder-V2-Lite-Instruct
ollama-intel-gpu  | llm_load_print_meta: BOS token        = 100000 '<|begin▁of▁sentence|>'
ollama-intel-gpu  | llm_load_print_meta: EOS token        = 100001 '<|end▁of▁sentence|>'
ollama-intel-gpu  | llm_load_print_meta: PAD token        = 100001 '<|end▁of▁sentence|>'
ollama-intel-gpu  | llm_load_print_meta: LF token         = 126 'Ä'
ollama-intel-gpu  | llm_load_print_meta: EOG token        = 100001 '<|end▁of▁sentence|>'
ollama-intel-gpu  | llm_load_print_meta: max token length = 256
ollama-intel-gpu  | llm_load_print_meta: n_layer_dense_lead   = 1
ollama-intel-gpu  | llm_load_print_meta: n_lora_q             = 0
ollama-intel-gpu  | llm_load_print_meta: n_lora_kv            = 512
ollama-intel-gpu  | llm_load_print_meta: n_ff_exp             = 1408
ollama-intel-gpu  | llm_load_print_meta: n_expert_shared      = 2
ollama-intel-gpu  | llm_load_print_meta: expert_weights_scale = 1.0
ollama-intel-gpu  | llm_load_print_meta: rope_yarn_log_mul    = 0.0707
ollama-intel-gpu  | time=2024-11-10T20:25:24.024Z level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model"
ollama-intel-gpu  | ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ollama-intel-gpu  | ggml_sycl_init: SYCL_USE_XMX: yes
ollama-intel-gpu  | ggml_sycl_init: found 1 SYCL devices:
ollama-intel-gpu  | llm_load_tensors: ggml ctx size =    0.32 MiB
ollama-intel-gpu  | llm_load_tensors: offloading 27 repeating layers to GPU
ollama-intel-gpu  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-intel-gpu  | llm_load_tensors: offloaded 28/28 layers to GPU
ollama-intel-gpu  | llm_load_tensors:      SYCL0 buffer size =  8376.27 MiB
ollama-intel-gpu  | llm_load_tensors:  SYCL_Host buffer size =   112.50 MiB
ollama-intel-gpu  | llama_new_context_with_model: n_ctx      = 8192
ollama-intel-gpu  | llama_new_context_with_model: n_batch    = 512
ollama-intel-gpu  | llama_new_context_with_model: n_ubatch   = 512
ollama-intel-gpu  | llama_new_context_with_model: flash_attn = 0
ollama-intel-gpu  | llama_new_context_with_model: freq_base  = 10000.0
ollama-intel-gpu  | llama_new_context_with_model: freq_scale = 0.025
ollama-intel-gpu  | [SYCL] call ggml_check_sycl
ollama-intel-gpu  | ggml_check_sycl: GGML_SYCL_DEBUG: 0
ollama-intel-gpu  | ggml_check_sycl: GGML_SYCL_F16: no
ollama-intel-gpu  | found 1 SYCL devices:
ollama-intel-gpu  | |  |                   |                                       |       |Max    |        |Max  |Global |                     |
ollama-intel-gpu  | |  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
ollama-intel-gpu  | |ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
ollama-intel-gpu  | |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
ollama-intel-gpu  | | 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.6|    512|    1024|   32| 16225M|            1.3.31294|
ollama-intel-gpu  | llama_kv_cache_init:      SYCL0 KV buffer size =  2160.00 MiB
ollama-intel-gpu  | llama_new_context_with_model: KV self size  = 2160.00 MiB, K (f16): 1296.00 MiB, V (f16):  864.00 MiB
ollama-intel-gpu  | llama_new_context_with_model:  SYCL_Host  output buffer size =     1.59 MiB
ollama-intel-gpu  | llama_new_context_with_model:      SYCL0 compute buffer size =   339.13 MiB
ollama-intel-gpu  | llama_new_context_with_model:  SYCL_Host compute buffer size =    38.01 MiB
ollama-intel-gpu  | llama_new_context_with_model: graph nodes  = 1951
ollama-intel-gpu  | llama_new_context_with_model: graph splits = 110
ollama-intel-gpu  | [1731270330] warming up the model with an empty run
ollama-intel-gpu  | INFO [main] model loaded | tid="139094668663808" timestamp=1731270335
ollama-intel-gpu  | time=2024-11-10T20:25:35.563Z level=INFO source=server.go:634 msg="llama runner started in 11.79 seconds"
ollama-intel-gpu  | [GIN] 2024/11/10 - 20:25:35 | 200 |  11.81337377s |       127.0.0.1 | POST     "/api/chat"
ollama-intel-gpu  | [GIN] 2024/11/10 - 20:25:46 | 200 |       22.86µs |       127.0.0.1 | HEAD     "/"
ollama-intel-gpu  | [GIN] 2024/11/10 - 20:25:46 | 200 |    6.807262ms |       127.0.0.1 | POST     "/api/show"
ollama-intel-gpu  | [GIN] 2024/11/10 - 20:25:46 | 200 |    6.526006ms |       127.0.0.1 | POST     "/api/chat"
ollama-intel-gpu  | check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
ollama-intel-gpu  | [GIN] 2024/11/10 - 20:25:59 | 200 |  9.400866991s |       127.0.0.1 | POST     "/api/chat"
sgwhat commented 2 weeks ago

Hi @pepijndevos , we have reproduced your issue and are working on finding a solution. We will inform you ASAP.

pepijndevos commented 1 week ago

I ran into similar but less obvious problems where qwen2.5-coder:14b will just get stuck int repeating patterns or suddenly start talking about something completely different, while running on CPU reliably produces sensible results.

| Q      | Output| Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
rynprrk commented 14 hours ago

I ran into similar but less obvious problems where qwen2.5-coder:14b will just get stuck int repeating patterns or suddenly start talking about something completely different, while running on CPU reliably produces sensible results.

| Q      | Output| Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data

I was able to reproduce the issue. I have a burning suspicion that this has to do with the way memory is being shared. I am running Arc A750 with iGPU disabled. Since the card only have 8GB of GDDR6, I can realistically only load one 8b parameter model reliably. When loading multiple models (where total memory >8GB) I see similar behavior.

My speculation is that something is going wrong when accessing models that share GPU and system memory.

qiuxin2012 commented 8 hours ago

I ran into similar but less obvious problems where qwen2.5-coder:14b will just get stuck int repeating patterns or suddenly start talking about something completely different, while running on CPU reliably produces sensible results.

| Q      | Output| Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data

Maybe we have fixed this 2 weeks ago, could you update your ipex-llm and try again?