gpustack / llama-box

LM inference server implementation based on llama.cpp.
MIT License
36 stars 5 forks source link

怎么加载拆分为多个文件的gguf模型? #7

Closed luckfu closed 1 week ago

luckfu commented 1 week ago

如果我的模型有多个gguf拆分,命令参数可以支持吗?例如 qwen2.5-32b-instruct-q5_k_m*.gguf 包含多个文件,我必须要merge它成我一个文件吗?

thxCode commented 1 week ago

make sure all your splits are in the same dir, then point out the first shard to box, it will finish the job.

llama-box -m xxx-00001-of-00004.gguf

luckfu commented 1 week ago

我很怀疑它是否真的加载了模型,我执行的情况:

ASCEND_RT_VISIBLE_DEVICES=0 ./llama-box -c 8192 -np 4 \
--host 0.0.0.0 \
-m /data/models/Qwen2.5-72B-Instruct-GGUF/qwen2.5-72b-instruct-q5_k_m-00001-of-00014.gguf \
--no-warmup
0.00.785.615 I
0.00.785.625 I version: v0.0.79 (8966164)
0.00.785.625 I compiler: cc (Ubuntu 11.4.0-2ubuntu1~20.04) 11.4.0
0.00.785.626 I target: aarch64-linux-gnu
0.00.785.626 I vendor:
0.00.785.627 I - llama.cpp 4a8ccb37 (395)
0.00.785.628 I - stable-diffusion.cpp ba589f6 (184)
0.00.788.154 I system_info: n_threads = 192 (n_threads_batch = 192) / 192 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
0.00.788.162 I
0.00.788.244 I srv                       main: listening, hostname = 0.0.0.0, port = 8080, n_threads = 6 + 2
0.00.789.479 I srv                       main: loading model
0.00.789.800 I llama_load_model_from_file: using device CANN0 (Ascend910B2) - 62130 MiB free
0.00.855.149 I llama_model_loader: additional 13 GGUFs metadata loaded.
0.00.855.159 I llama_model_loader: loaded meta data with 29 key-value pairs and 963 tensors from /data/models/Qwen2.5-72B-Instruct-GGUF/qwen2.5-72b-instruct-q5_k_m-00001-of-00014.gguf (version GGUF V3 (latest))
0.00.855.200 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.00.855.221 I llama_model_loader: - kv   0:                       general.architecture str              = qwen2
0.00.855.224 I llama_model_loader: - kv   1:                               general.type str              = model
0.00.855.226 I llama_model_loader: - kv   2:                               general.name str              = qwen2.5-72b-instruct
0.00.855.227 I llama_model_loader: - kv   3:                            general.version str              = v0.3
0.00.855.229 I llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-72b-instruct
0.00.855.230 I llama_model_loader: - kv   5:                         general.size_label str              = 73B
0.00.855.238 I llama_model_loader: - kv   6:                          qwen2.block_count u32              = 80
0.00.855.239 I llama_model_loader: - kv   7:                       qwen2.context_length u32              = 32768
0.00.855.243 I llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 8192
0.00.855.245 I llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 29696
0.00.855.246 I llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 64
0.00.855.248 I llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 8
0.00.855.256 I llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
0.00.855.259 I llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
0.00.855.261 I llama_model_loader: - kv  14:                          general.file_type u32              = 17
0.00.855.262 I llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
0.00.855.263 I llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
0.00.887.515 I llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
0.00.894.375 I llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.00.925.232 I llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
0.00.925.244 I llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
0.00.925.245 I llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
0.00.925.246 I llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
0.00.925.247 I llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
0.00.925.254 I llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
0.00.925.256 I llama_model_loader: - kv  25:               general.quantization_version u32              = 2
0.00.925.257 I llama_model_loader: - kv  26:                                   split.no u16              = 0
0.00.925.258 I llama_model_loader: - kv  27:                                split.count u16              = 14
0.00.925.260 I llama_model_loader: - kv  28:                        split.tensors.count i32              = 963
0.00.925.261 I llama_model_loader: - type  f32:  401 tensors
0.00.925.262 I llama_model_loader: - type q5_K:  481 tensors
0.00.925.263 I llama_model_loader: - type q6_K:   81 tensors
0.01.198.552 I llm_load_vocab: special tokens cache size = 22
0.01.271.330 I llm_load_vocab: token to piece cache size = 0.9310 MB
0.01.271.350 I llm_load_print_meta: format           = GGUF V3 (latest)
0.01.271.351 I llm_load_print_meta: arch             = qwen2
0.01.271.352 I llm_load_print_meta: vocab type       = BPE
0.01.271.357 I llm_load_print_meta: n_vocab          = 152064
0.01.271.358 I llm_load_print_meta: n_merges         = 151387
0.01.271.359 I llm_load_print_meta: vocab_only       = 0
0.01.271.360 I llm_load_print_meta: n_ctx_train      = 32768
0.01.271.361 I llm_load_print_meta: n_embd           = 8192
0.01.271.362 I llm_load_print_meta: n_layer          = 80
0.01.271.374 I llm_load_print_meta: n_head           = 64
0.01.271.379 I llm_load_print_meta: n_head_kv        = 8
0.01.271.381 I llm_load_print_meta: n_rot            = 128
0.01.271.382 I llm_load_print_meta: n_swa            = 0
0.01.271.383 I llm_load_print_meta: n_embd_head_k    = 128
0.01.271.383 I llm_load_print_meta: n_embd_head_v    = 128
0.01.271.387 I llm_load_print_meta: n_gqa            = 8
0.01.271.390 I llm_load_print_meta: n_embd_k_gqa     = 1024
0.01.271.393 I llm_load_print_meta: n_embd_v_gqa     = 1024
0.01.271.395 I llm_load_print_meta: f_norm_eps       = 0.0e+00
0.01.271.397 I llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
0.01.271.400 I llm_load_print_meta: f_clamp_kqv      = 0.0e+00
0.01.271.401 I llm_load_print_meta: f_max_alibi_bias = 0.0e+00
0.01.271.402 I llm_load_print_meta: f_logit_scale    = 0.0e+00
0.01.271.405 I llm_load_print_meta: n_ff             = 29696
0.01.271.409 I llm_load_print_meta: n_expert         = 0
0.01.271.410 I llm_load_print_meta: n_expert_used    = 0
0.01.271.410 I llm_load_print_meta: causal attn      = 1
0.01.271.411 I llm_load_print_meta: pooling type     = 0
0.01.271.412 I llm_load_print_meta: rope type        = 2
0.01.271.413 I llm_load_print_meta: rope scaling     = linear
0.01.271.416 I llm_load_print_meta: freq_base_train  = 1000000.0
0.01.271.417 I llm_load_print_meta: freq_scale_train = 1
0.01.271.419 I llm_load_print_meta: n_ctx_orig_yarn  = 32768
0.01.271.420 I llm_load_print_meta: rope_finetuned   = unknown
0.01.271.421 I llm_load_print_meta: ssm_d_conv       = 0
0.01.271.426 I llm_load_print_meta: ssm_d_inner      = 0
0.01.271.427 I llm_load_print_meta: ssm_d_state      = 0
0.01.271.427 I llm_load_print_meta: ssm_dt_rank      = 0
0.01.271.428 I llm_load_print_meta: ssm_dt_b_c_rms   = 0
0.01.271.430 I llm_load_print_meta: model type       = 70B
0.01.271.434 I llm_load_print_meta: model ftype      = Q5_K - Medium
0.01.271.436 I llm_load_print_meta: model params     = 72.96 B
0.01.271.438 I llm_load_print_meta: model size       = 48.12 GiB (5.67 BPW)
0.01.271.439 I llm_load_print_meta: general.name     = qwen2.5-72b-instruct
0.01.271.440 I llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
0.01.271.442 I llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
0.01.271.443 I llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
0.01.271.445 I llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
0.01.271.446 I llm_load_print_meta: LF token         = 148848 'ÄĬ'
0.01.271.447 I llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
0.01.271.450 I llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
0.01.271.451 I llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
0.01.271.452 I llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
0.01.271.453 I llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
0.01.271.454 I llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
0.01.271.455 I llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
0.01.271.456 I llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
0.01.271.458 I llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
0.01.271.459 I llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
0.01.271.460 I llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
0.01.271.461 I llm_load_print_meta: max token length = 256
0.04.618.901 I llm_load_tensors: offloading 0 repeating layers to GPU
0.04.618.910 I llm_load_tensors: offloaded 0/81 layers to GPU
0.04.618.926 I llm_load_tensors:   CPU_Mapped model buffer size =  3673.79 MiB
0.04.618.941 I llm_load_tensors:   CPU_Mapped model buffer size =  3655.80 MiB
0.04.618.942 I llm_load_tensors:   CPU_Mapped model buffer size =  3688.92 MiB
0.04.618.942 I llm_load_tensors:   CPU_Mapped model buffer size =  3688.86 MiB
0.04.618.964 I llm_load_tensors:   CPU_Mapped model buffer size =  3688.86 MiB
0.04.618.967 I llm_load_tensors:   CPU_Mapped model buffer size =  3787.93 MiB
0.04.618.967 I llm_load_tensors:   CPU_Mapped model buffer size =  3688.86 MiB
0.04.618.968 I llm_load_tensors:   CPU_Mapped model buffer size =  3788.99 MiB
0.04.618.969 I llm_load_tensors:   CPU_Mapped model buffer size =  3719.73 MiB
0.04.618.970 I llm_load_tensors:   CPU_Mapped model buffer size =  3688.86 MiB
0.04.618.971 I llm_load_tensors:   CPU_Mapped model buffer size =  3688.86 MiB
0.04.618.972 I llm_load_tensors:   CPU_Mapped model buffer size =  3788.96 MiB
0.04.618.973 I llm_load_tensors:   CPU_Mapped model buffer size =  3751.49 MiB
0.04.618.974 I llm_load_tensors:   CPU_Mapped model buffer size =   974.53 MiB
...................................................................................................
0.04.636.749 I llama_new_context_with_model: n_seq_max     = 4
0.04.636.757 I llama_new_context_with_model: n_ctx         = 8192
0.04.636.757 I llama_new_context_with_model: n_ctx_per_seq = 2048
0.04.636.758 I llama_new_context_with_model: n_batch       = 2048
0.04.636.759 I llama_new_context_with_model: n_ubatch      = 512
0.04.636.759 I llama_new_context_with_model: flash_attn    = 0
0.04.636.766 I llama_new_context_with_model: freq_base     = 1000000.0
0.04.636.780 I llama_new_context_with_model: freq_scale    = 1
0.04.636.787 W llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
0.04.988.355 I llama_kv_cache_init:  CANN_Host KV buffer size =  2560.00 MiB
0.04.988.379 I llama_new_context_with_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
0.04.988.599 I llama_new_context_with_model:        CPU  output buffer size =     2.32 MiB
0.05.011.668 I llama_new_context_with_model:      CANN0 compute buffer size =    48.03 MiB
0.05.011.681 I llama_new_context_with_model:  CANN_Host compute buffer size =  1060.01 MiB
0.05.011.709 I llama_new_context_with_model: graph nodes  = 2806
0.05.011.724 I llama_new_context_with_model: graph splits = 964 (with bs=512), 1 (with bs=1)
0.05.047.659 I srv                       main: initializing server
0.05.047.938 I srv                       init: initializing slots, n_slots = 4
0.05.048.271 I slot                      init: id  0 | task -1 | new slot n_ctx_slot = 2048
0.05.048.478 I slot                      init: id  1 | task -1 | new slot n_ctx_slot = 2048
0.05.048.483 I slot                      init: id  2 | task -1 | new slot n_ctx_slot = 2048
0.05.048.497 I slot                      init: id  3 | task -1 | new slot n_ctx_slot = 2048
0.05.049.123 I srv                       main: chat template, built_in: 1, chat_example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
0.05.049.245 I srv                       main: starting server

这时,我调用

curl -X POST 'http://127.0.0.1:8080/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer s" \
-d '{
    "model": "",
    "messages": [
        {"role": "user","content": "你是谁"}
    ],
    "temperature": 0
}'

llama-box会提示

3.19.329.351 I srv  oaicompat_completions_req: params: {"messages":"[...]","model":"","temperature":0}
3.22.433.322 I slot     launch_slot_with_task: id  1 | task 5 | processing task, max_tps = N/A
3.22.447.102 I slot              update_slots: id  1 | task 5 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 10

然后一直卡住,当我查看NPU状态时

[root@npu26 ~]#  npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B2               | OK            | 93.7        39                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          3456 / 65536         |
+===========================+===============+====================================================+
| 1     910B2               | OK            | 92.2        38                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           0    / 0          3347 / 65536         |
+===========================+===============+====================================================+
| 2     910B2               | OK            | 94.7        39                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          3343 / 65536         |
+===========================+===============+====================================================+
| 3     910B2               | OK            | 96.3        38                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          3341 / 65536         |
+===========================+===============+====================================================+
| 4     910B2               | OK            | 97.4        43                0    / 0             |
| 0                         | 0000:01:00.0  | 0           0    / 0          3341 / 65536         |
+===========================+===============+====================================================+
| 5     910B2               | OK            | 93.1        43                0    / 0             |
| 0                         | 0000:02:00.0  | 0           0    / 0          3341 / 65536         |
+===========================+===============+====================================================+
| 6     910B2               | OK            | 100.8       43                0    / 0             |
| 0                         | 0000:41:00.0  | 0           0    / 0          3341 / 65536         |
+===========================+===============+====================================================+
| 7     910B2               | OK            | 95.9        42                0    / 0             |
| 0                         | 0000:42:00.0  | 0           0    / 0          3340 / 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| 0       0                 | 4984          | llama-box                | 163                     |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+
                                                          |
thxCode commented 1 week ago
image

The top logs show that you are using the correct backend as desired.

image

These logs show you haven't specified how many layers to offload to the GPU.

You can use -ngl 99 to offload all layers into GPU or use gguf-parser to figure out how many layers show be offload according to your environment.

luckfu commented 1 week ago
 ASCEND_RT_VISIBLE_DEVICES=0 ./llama-box -c 8192 -np 4 --host 0.0.0.0 -ngl 99  -m ./qwen2.5-32b-instruct-q5_k_m.gguf      
0.00.770.406 I
0.00.770.414 I version: v0.0.79 (8966164)
0.00.770.414 I compiler: cc (Ubuntu 11.4.0-2ubuntu1~20.04) 11.4.0
0.00.770.415 I target: aarch64-linux-gnu
0.00.770.415 I vendor:
0.00.770.416 I - llama.cpp 4a8ccb37 (395)
0.00.770.417 I - stable-diffusion.cpp ba589f6 (184)
0.00.772.821 I system_info: n_threads = 192 (n_threads_batch = 192) / 192 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
0.00.772.825 I
0.00.772.905 I srv                       main: listening, hostname = 0.0.0.0, port = 8080, n_threads = 6 + 2
0.00.774.116 I srv                       main: loading model
0.00.774.416 I llama_load_model_from_file: using device CANN0 (Ascend910B2) - 62131 MiB free
0.00.838.139 I llama_model_loader: loaded meta data with 29 key-value pairs and 771 tensors from ./qwen2.5-32b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
0.00.838.171 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.00.838.189 I llama_model_loader: - kv   0:                       general.architecture str              = qwen2
0.00.838.192 I llama_model_loader: - kv   1:                               general.type str              = model
0.00.838.194 I llama_model_loader: - kv   2:                               general.name str              = qwen2.5-32b-instruct
0.00.838.195 I llama_model_loader: - kv   3:                            general.version str              = v0.1
0.00.838.197 I llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-32b-instruct
0.00.838.198 I llama_model_loader: - kv   5:                         general.size_label str              = 33B
0.00.838.200 I llama_model_loader: - kv   6:                          qwen2.block_count u32              = 64
0.00.838.201 I llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
0.00.838.202 I llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 5120
0.00.838.203 I llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 27648
0.00.838.204 I llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 40
0.00.838.205 I llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 8
0.00.838.218 I llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
0.00.838.221 I llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
0.00.838.221 I llama_model_loader: - kv  14:                          general.file_type u32              = 17
0.00.838.222 I llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
0.00.838.223 I llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
0.00.868.594 I llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
0.00.875.404 I llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.00.904.922 I llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
0.00.904.931 I llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
0.00.904.932 I llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
0.00.904.933 I llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
0.00.904.935 I llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
0.00.904.939 I llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
0.00.904.940 I llama_model_loader: - kv  25:               general.quantization_version u32              = 2
0.00.904.941 I llama_model_loader: - kv  26:                                   split.no u16              = 0
0.00.904.942 I llama_model_loader: - kv  27:                                split.count u16              = 0
0.00.904.943 I llama_model_loader: - kv  28:                        split.tensors.count i32              = 771
0.00.904.944 I llama_model_loader: - type  f32:  321 tensors
0.00.904.945 I llama_model_loader: - type q5_K:  385 tensors
0.00.904.946 I llama_model_loader: - type q6_K:   65 tensors
0.01.142.830 I llm_load_vocab: special tokens cache size = 22
0.01.215.395 I llm_load_vocab: token to piece cache size = 0.9310 MB
0.01.215.415 I llm_load_print_meta: format           = GGUF V3 (latest)
0.01.215.415 I llm_load_print_meta: arch             = qwen2
0.01.215.416 I llm_load_print_meta: vocab type       = BPE
0.01.215.419 I llm_load_print_meta: n_vocab          = 152064
0.01.215.420 I llm_load_print_meta: n_merges         = 151387
0.01.215.420 I llm_load_print_meta: vocab_only       = 0
0.01.215.421 I llm_load_print_meta: n_ctx_train      = 131072
0.01.215.421 I llm_load_print_meta: n_embd           = 5120
0.01.215.422 I llm_load_print_meta: n_layer          = 64
0.01.215.437 I llm_load_print_meta: n_head           = 40
0.01.215.440 I llm_load_print_meta: n_head_kv        = 8
0.01.215.442 I llm_load_print_meta: n_rot            = 128
0.01.215.444 I llm_load_print_meta: n_swa            = 0
0.01.215.445 I llm_load_print_meta: n_embd_head_k    = 128
0.01.215.445 I llm_load_print_meta: n_embd_head_v    = 128
0.01.215.447 I llm_load_print_meta: n_gqa            = 5
0.01.215.450 I llm_load_print_meta: n_embd_k_gqa     = 1024
0.01.215.452 I llm_load_print_meta: n_embd_v_gqa     = 1024
0.01.215.453 I llm_load_print_meta: f_norm_eps       = 0.0e+00
0.01.215.455 I llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
0.01.215.456 I llm_load_print_meta: f_clamp_kqv      = 0.0e+00
0.01.215.457 I llm_load_print_meta: f_max_alibi_bias = 0.0e+00
0.01.215.457 I llm_load_print_meta: f_logit_scale    = 0.0e+00
0.01.215.460 I llm_load_print_meta: n_ff             = 27648
0.01.215.460 I llm_load_print_meta: n_expert         = 0
0.01.215.461 I llm_load_print_meta: n_expert_used    = 0
0.01.215.461 I llm_load_print_meta: causal attn      = 1
0.01.215.461 I llm_load_print_meta: pooling type     = 0
0.01.215.462 I llm_load_print_meta: rope type        = 2
0.01.215.463 I llm_load_print_meta: rope scaling     = linear
0.01.215.467 I llm_load_print_meta: freq_base_train  = 1000000.0
0.01.215.468 I llm_load_print_meta: freq_scale_train = 1
0.01.215.468 I llm_load_print_meta: n_ctx_orig_yarn  = 131072
0.01.215.469 I llm_load_print_meta: rope_finetuned   = unknown
0.01.215.469 I llm_load_print_meta: ssm_d_conv       = 0
0.01.215.470 I llm_load_print_meta: ssm_d_inner      = 0
0.01.215.471 I llm_load_print_meta: ssm_d_state      = 0
0.01.215.471 I llm_load_print_meta: ssm_dt_rank      = 0
0.01.215.472 I llm_load_print_meta: ssm_dt_b_c_rms   = 0
0.01.215.473 I llm_load_print_meta: model type       = ?B
0.01.215.475 I llm_load_print_meta: model ftype      = Q5_K - Medium
0.01.215.477 I llm_load_print_meta: model params     = 32.76 B
0.01.215.478 I llm_load_print_meta: model size       = 21.66 GiB (5.68 BPW)
0.01.215.478 I llm_load_print_meta: general.name     = qwen2.5-32b-instruct
0.01.215.479 I llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
0.01.215.480 I llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
0.01.215.480 I llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
0.01.215.481 I llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
0.01.215.482 I llm_load_print_meta: LF token         = 148848 'ÄĬ'
0.01.215.482 I llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
0.01.215.483 I llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
0.01.215.484 I llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
0.01.215.484 I llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
0.01.215.485 I llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
0.01.215.485 I llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
0.01.215.486 I llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
0.01.215.487 I llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
0.01.215.487 I llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
0.01.215.488 I llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
0.01.215.488 I llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
0.01.215.489 I llm_load_print_meta: max token length = 256
0.02.789.838 I llm_load_tensors: offloading 64 repeating layers to GPU
0.02.789.845 I llm_load_tensors: offloading output layer to GPU
0.02.789.846 I llm_load_tensors: offloaded 65/65 layers to GPU
0.02.789.863 I llm_load_tensors:   CPU_Mapped model buffer size = 22178.80 MiB
0.02.789.865 I llm_load_tensors:        CANN0 model buffer size =     4.27 MiB
.................................................................................................
0.02.813.855 I llama_new_context_with_model: n_seq_max     = 4
0.02.813.867 I llama_new_context_with_model: n_ctx         = 8192
0.02.813.868 I llama_new_context_with_model: n_ctx_per_seq = 2048
0.02.813.869 I llama_new_context_with_model: n_batch       = 2048
0.02.813.869 I llama_new_context_with_model: n_ubatch      = 512
0.02.813.871 I llama_new_context_with_model: flash_attn    = 0
0.02.813.903 I llama_new_context_with_model: freq_base     = 1000000.0
0.02.813.909 I llama_new_context_with_model: freq_scale    = 1
0.02.813.914 W llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
0.02.922.429 I llama_kv_cache_init:      CANN0 KV buffer size =  2048.00 MiB
0.02.922.448 I llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
0.02.922.755 I llama_new_context_with_model:  CANN_Host  output buffer size =     2.32 MiB
0.02.936.736 I llama_new_context_with_model:      CANN0 compute buffer size =   686.00 MiB
0.02.936.748 I llama_new_context_with_model:  CANN_Host compute buffer size =   307.00 MiB
0.02.936.749 I llama_new_context_with_model: graph nodes  = 2246
0.02.936.749 I llama_new_context_with_model: graph splits = 643
0.02.936.752 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.35.929.924 I srv                       main: initializing server
0.35.929.936 I srv                       init: initializing slots, n_slots = 4
0.35.929.939 I slot                      init: id  0 | task -1 | new slot n_ctx_slot = 2048
0.35.929.953 I slot                      init: id  1 | task -1 | new slot n_ctx_slot = 2048
0.35.930.017 I slot                      init: id  2 | task -1 | new slot n_ctx_slot = 2048
0.35.930.067 I slot                      init: id  3 | task -1 | new slot n_ctx_slot = 2048
0.35.931.327 I srv                       main: chat template, built_in: 1, chat_example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
0.35.931.333 I srv                       main: starting server
1.17.322.386 I srv  oaicompat_completions_req: params: {"messages":"[...]","model":"qwen2","stream":false,"temperature":0}
1.17.325.968 I slot     launch_slot_with_task: id  0 | task 0 | processing task, max_tps = N/A
1.17.325.987 I slot              update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 10

现在我看到

0.02.789.838 I llm_load_tensors: offloading 64 repeating layers to GPU
0.02.789.845 I llm_load_tensors: offloading output layer to GPU
0.02.789.846 I llm_load_tensors: offloaded 65/65 layers to GPU

但是curl调用依然卡死

luckfu commented 1 week ago

我看到一个文档链接,其中提到: 从 ModelScope 部署 Qwen 2.5 的全系列模型,目前 CANN 算子的支持完整度方面还有不足,目前只能运行 FP16 精度、Q8_0 和Q4_0 量化的模型,建议运行 FP16 精度的模型 所以我估计我可能需要量化模型,🙏

thxCode commented 1 week ago

CANN only supports FP16/Q8/Q4, other types will fallback to CPU computation.