intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.56k stars 1.25k forks source link

Qwen1.5-A2.7B cannot be offload on iGPU #11429

Open Ce-daros opened 3 months ago

Ce-daros commented 3 months ago

Log output:

(llm-cpp) D:\Users\Documents\Projects\llama-cpp>server.exe -m "Qwen1.5-MoE-A2.7B-Chat.Q4_K_M.gguf" -ngl 999
{"tid":"12472","timestamp":1719324322,"level":"INFO","function":"main","line":2943,"msg":"build info","build":1,"commit":"adbd0dc"}
{"tid":"12472","timestamp":1719324322,"level":"INFO","function":"main","line":2950,"msg":"system info","n_threads":11,"n_threads_batch":-1,"total_threads":22,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "}
llama_model_loader: loaded meta data with 22 key-value pairs and 411 tensors from Qwen1.5-MoE-A2.7B-Chat.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2moe
llama_model_loader: - kv   1:                               general.name str              = tmpwjxi1ht2
llama_model_loader: - kv   2:                       qwen2moe.block_count u32              = 24
llama_model_loader: - kv   3:                    qwen2moe.context_length u32              = 32768
llama_model_loader: - kv   4:                  qwen2moe.embedding_length u32              = 2048
llama_model_loader: - kv   5:               qwen2moe.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:              qwen2moe.attention.head_count u32              = 16
llama_model_loader: - kv   7:           qwen2moe.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                    qwen2moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:  qwen2moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                 qwen2moe.expert_used_count u32              = 4
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                      qwen2moe.expert_count u32              = 60
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type  f16:   24 tensors
llama_model_loader: - type q5_0:   12 tensors
llama_model_loader: - type q8_0:   12 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   25 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 60
llm_load_print_meta: n_expert_used    = 4
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = A2.7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 14.32 B
llm_load_print_meta: model size       = 8.83 GiB (5.30 BPW)
llm_load_print_meta: general.name     = tmpwjxi1ht2
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |
    |
|  |                   |                                       |       |compute|Max work|sub  |mem    |
    |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 13578M|            1.3.27504|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
llm_load_tensors: ggml ctx size =    0.18 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors:        CPU buffer size =  9045.01 MiB
..................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  SYCL_Host KV buffer size =    96.00 MiB
llama_new_context_with_model: KV self size  =   96.00 MiB, K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     1.16 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =    50.35 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =   300.75 MiB
llama_new_context_with_model: graph nodes  = 1446
llama_new_context_with_model: graph splits = 436
{"tid":"12472","timestamp":1719324323,"level":"INFO","function":"init","line":715,"msg":"initializing slots","n_slots":1}
{"tid":"12472","timestamp":1719324323,"level":"INFO","function":"init","line":727,"msg":"new slot","id_slot":0,"n_ctx_slot":512}
{"tid":"12472","timestamp":1719324323,"level":"INFO","function":"main","line":3040,"msg":"model loaded"}
{"tid":"12472","timestamp":1719324323,"level":"INFO","function":"main","line":3065,"msg":"chat template","chat_example":"<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n","built_in":true}
{"tid":"12472","timestamp":1719324323,"level":"INFO","function":"main","line":3793,"msg":"HTTP server listening","hostname":"127.0.0.1","port":"8080","n_threads_http":"21"}
{"tid":"12472","timestamp":1719324323,"level":"INFO","function":"update_slots","line":1812,"msg":"all slots are idle"}
Ce-daros commented 3 months ago

And Qwen1.5-0.5B works very well:

(llm-cpp) D:\Users\Documents\Projects\llama-cpp>server.exe -m "qwen1_5-0_5b-chat-q6_k.gguf" -ngl 999
{"tid":"4428","timestamp":1719324251,"level":"INFO","function":"main","line":2943,"msg":"build info","build":1,"commit":"adbd0dc"}
{"tid":"4428","timestamp":1719324251,"level":"INFO","function":"main","line":2950,"msg":"system info","n_threads":11,"n_threads_batch":-1,"total_threads":22,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "}
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from qwen1_5-0_5b-chat-q6_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-0.5B-Chat-AWQ-fp16
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 1024
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 2816
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  10:                qwen2.use_parallel_residual bool             = true
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - kv  20:                          general.file_type u32              = 18
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q6_K:  170 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 1024
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 2816
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 0.5B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 619.57 M
llm_load_print_meta: model size       = 485.07 MiB (6.57 BPW)
llm_load_print_meta: general.name     = Qwen1.5-0.5B-Chat-AWQ-fp16
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |
    |
|  |                   |                                       |       |compute|Max work|sub  |mem    |
    |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 13578M|            1.3.27504|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:      SYCL0 buffer size =   363.36 MiB
llm_load_tensors:        CPU buffer size =   121.71 MiB
...................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =    48.00 MiB
llama_new_context_with_model: KV self size  =   48.00 MiB, K (f16):   24.00 MiB, V (f16):   24.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     1.16 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   298.75 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     3.01 MiB
llama_new_context_with_model: graph nodes  = 750
llama_new_context_with_model: graph splits = 2
{"tid":"4428","timestamp":1719324259,"level":"INFO","function":"init","line":715,"msg":"initializing slots","n_slots":1}
{"tid":"4428","timestamp":1719324259,"level":"INFO","function":"init","line":727,"msg":"new slot","id_slot":0,"n_ctx_slot":512}
{"tid":"4428","timestamp":1719324259,"level":"INFO","function":"main","line":3040,"msg":"model loaded"}
{"tid":"4428","timestamp":1719324259,"level":"INFO","function":"main","line":3065,"msg":"chat template","chat_example":"<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n","built_in":true}
{"tid":"4428","timestamp":1719324259,"level":"INFO","function":"main","line":3793,"msg":"HTTP server listening","hostname":"127.0.0.1","port":"8080","n_threads_http":"21"}
{"tid":"4428","timestamp":1719324259,"level":"INFO","function":"update_slots","line":1812,"msg":"all slots are idle"}
rnwang04 commented 3 months ago

Hi @Ce-daros , ipex-llm cpp's support for MoE models is still work in progress. Will inform you here once it is done.

YixinSong-e commented 3 months ago

Does ipex-llm cpp has the open source plan?