Closed luckfu closed 1 week ago
make sure all your splits are in the same dir, then point out the first shard to box, it will finish the job.
llama-box -m xxx-00001-of-00004.gguf
我很怀疑它是否真的加载了模型,我执行的情况:
ASCEND_RT_VISIBLE_DEVICES=0 ./llama-box -c 8192 -np 4 \
--host 0.0.0.0 \
-m /data/models/Qwen2.5-72B-Instruct-GGUF/qwen2.5-72b-instruct-q5_k_m-00001-of-00014.gguf \
--no-warmup
0.00.785.615 I
0.00.785.625 I version: v0.0.79 (8966164)
0.00.785.625 I compiler: cc (Ubuntu 11.4.0-2ubuntu1~20.04) 11.4.0
0.00.785.626 I target: aarch64-linux-gnu
0.00.785.626 I vendor:
0.00.785.627 I - llama.cpp 4a8ccb37 (395)
0.00.785.628 I - stable-diffusion.cpp ba589f6 (184)
0.00.788.154 I system_info: n_threads = 192 (n_threads_batch = 192) / 192 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
0.00.788.162 I
0.00.788.244 I srv main: listening, hostname = 0.0.0.0, port = 8080, n_threads = 6 + 2
0.00.789.479 I srv main: loading model
0.00.789.800 I llama_load_model_from_file: using device CANN0 (Ascend910B2) - 62130 MiB free
0.00.855.149 I llama_model_loader: additional 13 GGUFs metadata loaded.
0.00.855.159 I llama_model_loader: loaded meta data with 29 key-value pairs and 963 tensors from /data/models/Qwen2.5-72B-Instruct-GGUF/qwen2.5-72b-instruct-q5_k_m-00001-of-00014.gguf (version GGUF V3 (latest))
0.00.855.200 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.00.855.221 I llama_model_loader: - kv 0: general.architecture str = qwen2
0.00.855.224 I llama_model_loader: - kv 1: general.type str = model
0.00.855.226 I llama_model_loader: - kv 2: general.name str = qwen2.5-72b-instruct
0.00.855.227 I llama_model_loader: - kv 3: general.version str = v0.3
0.00.855.229 I llama_model_loader: - kv 4: general.finetune str = qwen2.5-72b-instruct
0.00.855.230 I llama_model_loader: - kv 5: general.size_label str = 73B
0.00.855.238 I llama_model_loader: - kv 6: qwen2.block_count u32 = 80
0.00.855.239 I llama_model_loader: - kv 7: qwen2.context_length u32 = 32768
0.00.855.243 I llama_model_loader: - kv 8: qwen2.embedding_length u32 = 8192
0.00.855.245 I llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 29696
0.00.855.246 I llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 64
0.00.855.248 I llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 8
0.00.855.256 I llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
0.00.855.259 I llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
0.00.855.261 I llama_model_loader: - kv 14: general.file_type u32 = 17
0.00.855.262 I llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
0.00.855.263 I llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
0.00.887.515 I llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
0.00.894.375 I llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.00.925.232 I llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
0.00.925.244 I llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
0.00.925.245 I llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
0.00.925.246 I llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
0.00.925.247 I llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
0.00.925.254 I llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
0.00.925.256 I llama_model_loader: - kv 25: general.quantization_version u32 = 2
0.00.925.257 I llama_model_loader: - kv 26: split.no u16 = 0
0.00.925.258 I llama_model_loader: - kv 27: split.count u16 = 14
0.00.925.260 I llama_model_loader: - kv 28: split.tensors.count i32 = 963
0.00.925.261 I llama_model_loader: - type f32: 401 tensors
0.00.925.262 I llama_model_loader: - type q5_K: 481 tensors
0.00.925.263 I llama_model_loader: - type q6_K: 81 tensors
0.01.198.552 I llm_load_vocab: special tokens cache size = 22
0.01.271.330 I llm_load_vocab: token to piece cache size = 0.9310 MB
0.01.271.350 I llm_load_print_meta: format = GGUF V3 (latest)
0.01.271.351 I llm_load_print_meta: arch = qwen2
0.01.271.352 I llm_load_print_meta: vocab type = BPE
0.01.271.357 I llm_load_print_meta: n_vocab = 152064
0.01.271.358 I llm_load_print_meta: n_merges = 151387
0.01.271.359 I llm_load_print_meta: vocab_only = 0
0.01.271.360 I llm_load_print_meta: n_ctx_train = 32768
0.01.271.361 I llm_load_print_meta: n_embd = 8192
0.01.271.362 I llm_load_print_meta: n_layer = 80
0.01.271.374 I llm_load_print_meta: n_head = 64
0.01.271.379 I llm_load_print_meta: n_head_kv = 8
0.01.271.381 I llm_load_print_meta: n_rot = 128
0.01.271.382 I llm_load_print_meta: n_swa = 0
0.01.271.383 I llm_load_print_meta: n_embd_head_k = 128
0.01.271.383 I llm_load_print_meta: n_embd_head_v = 128
0.01.271.387 I llm_load_print_meta: n_gqa = 8
0.01.271.390 I llm_load_print_meta: n_embd_k_gqa = 1024
0.01.271.393 I llm_load_print_meta: n_embd_v_gqa = 1024
0.01.271.395 I llm_load_print_meta: f_norm_eps = 0.0e+00
0.01.271.397 I llm_load_print_meta: f_norm_rms_eps = 1.0e-06
0.01.271.400 I llm_load_print_meta: f_clamp_kqv = 0.0e+00
0.01.271.401 I llm_load_print_meta: f_max_alibi_bias = 0.0e+00
0.01.271.402 I llm_load_print_meta: f_logit_scale = 0.0e+00
0.01.271.405 I llm_load_print_meta: n_ff = 29696
0.01.271.409 I llm_load_print_meta: n_expert = 0
0.01.271.410 I llm_load_print_meta: n_expert_used = 0
0.01.271.410 I llm_load_print_meta: causal attn = 1
0.01.271.411 I llm_load_print_meta: pooling type = 0
0.01.271.412 I llm_load_print_meta: rope type = 2
0.01.271.413 I llm_load_print_meta: rope scaling = linear
0.01.271.416 I llm_load_print_meta: freq_base_train = 1000000.0
0.01.271.417 I llm_load_print_meta: freq_scale_train = 1
0.01.271.419 I llm_load_print_meta: n_ctx_orig_yarn = 32768
0.01.271.420 I llm_load_print_meta: rope_finetuned = unknown
0.01.271.421 I llm_load_print_meta: ssm_d_conv = 0
0.01.271.426 I llm_load_print_meta: ssm_d_inner = 0
0.01.271.427 I llm_load_print_meta: ssm_d_state = 0
0.01.271.427 I llm_load_print_meta: ssm_dt_rank = 0
0.01.271.428 I llm_load_print_meta: ssm_dt_b_c_rms = 0
0.01.271.430 I llm_load_print_meta: model type = 70B
0.01.271.434 I llm_load_print_meta: model ftype = Q5_K - Medium
0.01.271.436 I llm_load_print_meta: model params = 72.96 B
0.01.271.438 I llm_load_print_meta: model size = 48.12 GiB (5.67 BPW)
0.01.271.439 I llm_load_print_meta: general.name = qwen2.5-72b-instruct
0.01.271.440 I llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
0.01.271.442 I llm_load_print_meta: EOS token = 151645 '<|im_end|>'
0.01.271.443 I llm_load_print_meta: EOT token = 151645 '<|im_end|>'
0.01.271.445 I llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
0.01.271.446 I llm_load_print_meta: LF token = 148848 'ÄĬ'
0.01.271.447 I llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
0.01.271.450 I llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
0.01.271.451 I llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
0.01.271.452 I llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
0.01.271.453 I llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
0.01.271.454 I llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
0.01.271.455 I llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
0.01.271.456 I llm_load_print_meta: EOG token = 151645 '<|im_end|>'
0.01.271.458 I llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
0.01.271.459 I llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
0.01.271.460 I llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
0.01.271.461 I llm_load_print_meta: max token length = 256
0.04.618.901 I llm_load_tensors: offloading 0 repeating layers to GPU
0.04.618.910 I llm_load_tensors: offloaded 0/81 layers to GPU
0.04.618.926 I llm_load_tensors: CPU_Mapped model buffer size = 3673.79 MiB
0.04.618.941 I llm_load_tensors: CPU_Mapped model buffer size = 3655.80 MiB
0.04.618.942 I llm_load_tensors: CPU_Mapped model buffer size = 3688.92 MiB
0.04.618.942 I llm_load_tensors: CPU_Mapped model buffer size = 3688.86 MiB
0.04.618.964 I llm_load_tensors: CPU_Mapped model buffer size = 3688.86 MiB
0.04.618.967 I llm_load_tensors: CPU_Mapped model buffer size = 3787.93 MiB
0.04.618.967 I llm_load_tensors: CPU_Mapped model buffer size = 3688.86 MiB
0.04.618.968 I llm_load_tensors: CPU_Mapped model buffer size = 3788.99 MiB
0.04.618.969 I llm_load_tensors: CPU_Mapped model buffer size = 3719.73 MiB
0.04.618.970 I llm_load_tensors: CPU_Mapped model buffer size = 3688.86 MiB
0.04.618.971 I llm_load_tensors: CPU_Mapped model buffer size = 3688.86 MiB
0.04.618.972 I llm_load_tensors: CPU_Mapped model buffer size = 3788.96 MiB
0.04.618.973 I llm_load_tensors: CPU_Mapped model buffer size = 3751.49 MiB
0.04.618.974 I llm_load_tensors: CPU_Mapped model buffer size = 974.53 MiB
...................................................................................................
0.04.636.749 I llama_new_context_with_model: n_seq_max = 4
0.04.636.757 I llama_new_context_with_model: n_ctx = 8192
0.04.636.757 I llama_new_context_with_model: n_ctx_per_seq = 2048
0.04.636.758 I llama_new_context_with_model: n_batch = 2048
0.04.636.759 I llama_new_context_with_model: n_ubatch = 512
0.04.636.759 I llama_new_context_with_model: flash_attn = 0
0.04.636.766 I llama_new_context_with_model: freq_base = 1000000.0
0.04.636.780 I llama_new_context_with_model: freq_scale = 1
0.04.636.787 W llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
0.04.988.355 I llama_kv_cache_init: CANN_Host KV buffer size = 2560.00 MiB
0.04.988.379 I llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
0.04.988.599 I llama_new_context_with_model: CPU output buffer size = 2.32 MiB
0.05.011.668 I llama_new_context_with_model: CANN0 compute buffer size = 48.03 MiB
0.05.011.681 I llama_new_context_with_model: CANN_Host compute buffer size = 1060.01 MiB
0.05.011.709 I llama_new_context_with_model: graph nodes = 2806
0.05.011.724 I llama_new_context_with_model: graph splits = 964 (with bs=512), 1 (with bs=1)
0.05.047.659 I srv main: initializing server
0.05.047.938 I srv init: initializing slots, n_slots = 4
0.05.048.271 I slot init: id 0 | task -1 | new slot n_ctx_slot = 2048
0.05.048.478 I slot init: id 1 | task -1 | new slot n_ctx_slot = 2048
0.05.048.483 I slot init: id 2 | task -1 | new slot n_ctx_slot = 2048
0.05.048.497 I slot init: id 3 | task -1 | new slot n_ctx_slot = 2048
0.05.049.123 I srv main: chat template, built_in: 1, chat_example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
0.05.049.245 I srv main: starting server
这时,我调用
curl -X POST 'http://127.0.0.1:8080/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer s" \
-d '{
"model": "",
"messages": [
{"role": "user","content": "你是谁"}
],
"temperature": 0
}'
llama-box会提示
3.19.329.351 I srv oaicompat_completions_req: params: {"messages":"[...]","model":"","temperature":0}
3.22.433.322 I slot launch_slot_with_task: id 1 | task 5 | processing task, max_tps = N/A
3.22.447.102 I slot update_slots: id 1 | task 5 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 10
然后一直卡住,当我查看NPU状态时
[root@npu26 ~]# npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2 Version: 24.1.rc2 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B2 | OK | 93.7 39 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 3456 / 65536 |
+===========================+===============+====================================================+
| 1 910B2 | OK | 92.2 38 0 / 0 |
| 0 | 0000:C2:00.0 | 0 0 / 0 3347 / 65536 |
+===========================+===============+====================================================+
| 2 910B2 | OK | 94.7 39 0 / 0 |
| 0 | 0000:81:00.0 | 0 0 / 0 3343 / 65536 |
+===========================+===============+====================================================+
| 3 910B2 | OK | 96.3 38 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 3341 / 65536 |
+===========================+===============+====================================================+
| 4 910B2 | OK | 97.4 43 0 / 0 |
| 0 | 0000:01:00.0 | 0 0 / 0 3341 / 65536 |
+===========================+===============+====================================================+
| 5 910B2 | OK | 93.1 43 0 / 0 |
| 0 | 0000:02:00.0 | 0 0 / 0 3341 / 65536 |
+===========================+===============+====================================================+
| 6 910B2 | OK | 100.8 43 0 / 0 |
| 0 | 0000:41:00.0 | 0 0 / 0 3341 / 65536 |
+===========================+===============+====================================================+
| 7 910B2 | OK | 95.9 42 0 / 0 |
| 0 | 0000:42:00.0 | 0 0 / 0 3340 / 65536 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| 0 0 | 4984 | llama-box | 163 |
+===========================+===============+====================================================+
| No running processes found in NPU 1 |
+===========================+===============+====================================================+
| No running processes found in NPU 2 |
+===========================+===============+====================================================+
| No running processes found in NPU 3 |
+===========================+===============+====================================================+
| No running processes found in NPU 4 |
+===========================+===============+====================================================+
| No running processes found in NPU 5 |
+===========================+===============+====================================================+
| No running processes found in NPU 6 |
+===========================+===============+====================================================+
| No running processes found in NPU 7 |
+===========================+===============+====================================================+
|
The top logs show that you are using the correct backend as desired.
These logs show you haven't specified how many layers to offload to the GPU.
You can use -ngl 99
to offload all layers into GPU or use gguf-parser to figure out how many layers show be offload according to your environment.
ASCEND_RT_VISIBLE_DEVICES=0 ./llama-box -c 8192 -np 4 --host 0.0.0.0 -ngl 99 -m ./qwen2.5-32b-instruct-q5_k_m.gguf
0.00.770.406 I
0.00.770.414 I version: v0.0.79 (8966164)
0.00.770.414 I compiler: cc (Ubuntu 11.4.0-2ubuntu1~20.04) 11.4.0
0.00.770.415 I target: aarch64-linux-gnu
0.00.770.415 I vendor:
0.00.770.416 I - llama.cpp 4a8ccb37 (395)
0.00.770.417 I - stable-diffusion.cpp ba589f6 (184)
0.00.772.821 I system_info: n_threads = 192 (n_threads_batch = 192) / 192 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
0.00.772.825 I
0.00.772.905 I srv main: listening, hostname = 0.0.0.0, port = 8080, n_threads = 6 + 2
0.00.774.116 I srv main: loading model
0.00.774.416 I llama_load_model_from_file: using device CANN0 (Ascend910B2) - 62131 MiB free
0.00.838.139 I llama_model_loader: loaded meta data with 29 key-value pairs and 771 tensors from ./qwen2.5-32b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
0.00.838.171 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.00.838.189 I llama_model_loader: - kv 0: general.architecture str = qwen2
0.00.838.192 I llama_model_loader: - kv 1: general.type str = model
0.00.838.194 I llama_model_loader: - kv 2: general.name str = qwen2.5-32b-instruct
0.00.838.195 I llama_model_loader: - kv 3: general.version str = v0.1
0.00.838.197 I llama_model_loader: - kv 4: general.finetune str = qwen2.5-32b-instruct
0.00.838.198 I llama_model_loader: - kv 5: general.size_label str = 33B
0.00.838.200 I llama_model_loader: - kv 6: qwen2.block_count u32 = 64
0.00.838.201 I llama_model_loader: - kv 7: qwen2.context_length u32 = 131072
0.00.838.202 I llama_model_loader: - kv 8: qwen2.embedding_length u32 = 5120
0.00.838.203 I llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 27648
0.00.838.204 I llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 40
0.00.838.205 I llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 8
0.00.838.218 I llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
0.00.838.221 I llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
0.00.838.221 I llama_model_loader: - kv 14: general.file_type u32 = 17
0.00.838.222 I llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
0.00.838.223 I llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
0.00.868.594 I llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
0.00.875.404 I llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.00.904.922 I llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
0.00.904.931 I llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
0.00.904.932 I llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
0.00.904.933 I llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
0.00.904.935 I llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
0.00.904.939 I llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
0.00.904.940 I llama_model_loader: - kv 25: general.quantization_version u32 = 2
0.00.904.941 I llama_model_loader: - kv 26: split.no u16 = 0
0.00.904.942 I llama_model_loader: - kv 27: split.count u16 = 0
0.00.904.943 I llama_model_loader: - kv 28: split.tensors.count i32 = 771
0.00.904.944 I llama_model_loader: - type f32: 321 tensors
0.00.904.945 I llama_model_loader: - type q5_K: 385 tensors
0.00.904.946 I llama_model_loader: - type q6_K: 65 tensors
0.01.142.830 I llm_load_vocab: special tokens cache size = 22
0.01.215.395 I llm_load_vocab: token to piece cache size = 0.9310 MB
0.01.215.415 I llm_load_print_meta: format = GGUF V3 (latest)
0.01.215.415 I llm_load_print_meta: arch = qwen2
0.01.215.416 I llm_load_print_meta: vocab type = BPE
0.01.215.419 I llm_load_print_meta: n_vocab = 152064
0.01.215.420 I llm_load_print_meta: n_merges = 151387
0.01.215.420 I llm_load_print_meta: vocab_only = 0
0.01.215.421 I llm_load_print_meta: n_ctx_train = 131072
0.01.215.421 I llm_load_print_meta: n_embd = 5120
0.01.215.422 I llm_load_print_meta: n_layer = 64
0.01.215.437 I llm_load_print_meta: n_head = 40
0.01.215.440 I llm_load_print_meta: n_head_kv = 8
0.01.215.442 I llm_load_print_meta: n_rot = 128
0.01.215.444 I llm_load_print_meta: n_swa = 0
0.01.215.445 I llm_load_print_meta: n_embd_head_k = 128
0.01.215.445 I llm_load_print_meta: n_embd_head_v = 128
0.01.215.447 I llm_load_print_meta: n_gqa = 5
0.01.215.450 I llm_load_print_meta: n_embd_k_gqa = 1024
0.01.215.452 I llm_load_print_meta: n_embd_v_gqa = 1024
0.01.215.453 I llm_load_print_meta: f_norm_eps = 0.0e+00
0.01.215.455 I llm_load_print_meta: f_norm_rms_eps = 1.0e-06
0.01.215.456 I llm_load_print_meta: f_clamp_kqv = 0.0e+00
0.01.215.457 I llm_load_print_meta: f_max_alibi_bias = 0.0e+00
0.01.215.457 I llm_load_print_meta: f_logit_scale = 0.0e+00
0.01.215.460 I llm_load_print_meta: n_ff = 27648
0.01.215.460 I llm_load_print_meta: n_expert = 0
0.01.215.461 I llm_load_print_meta: n_expert_used = 0
0.01.215.461 I llm_load_print_meta: causal attn = 1
0.01.215.461 I llm_load_print_meta: pooling type = 0
0.01.215.462 I llm_load_print_meta: rope type = 2
0.01.215.463 I llm_load_print_meta: rope scaling = linear
0.01.215.467 I llm_load_print_meta: freq_base_train = 1000000.0
0.01.215.468 I llm_load_print_meta: freq_scale_train = 1
0.01.215.468 I llm_load_print_meta: n_ctx_orig_yarn = 131072
0.01.215.469 I llm_load_print_meta: rope_finetuned = unknown
0.01.215.469 I llm_load_print_meta: ssm_d_conv = 0
0.01.215.470 I llm_load_print_meta: ssm_d_inner = 0
0.01.215.471 I llm_load_print_meta: ssm_d_state = 0
0.01.215.471 I llm_load_print_meta: ssm_dt_rank = 0
0.01.215.472 I llm_load_print_meta: ssm_dt_b_c_rms = 0
0.01.215.473 I llm_load_print_meta: model type = ?B
0.01.215.475 I llm_load_print_meta: model ftype = Q5_K - Medium
0.01.215.477 I llm_load_print_meta: model params = 32.76 B
0.01.215.478 I llm_load_print_meta: model size = 21.66 GiB (5.68 BPW)
0.01.215.478 I llm_load_print_meta: general.name = qwen2.5-32b-instruct
0.01.215.479 I llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
0.01.215.480 I llm_load_print_meta: EOS token = 151645 '<|im_end|>'
0.01.215.480 I llm_load_print_meta: EOT token = 151645 '<|im_end|>'
0.01.215.481 I llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
0.01.215.482 I llm_load_print_meta: LF token = 148848 'ÄĬ'
0.01.215.482 I llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
0.01.215.483 I llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
0.01.215.484 I llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
0.01.215.484 I llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
0.01.215.485 I llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
0.01.215.485 I llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
0.01.215.486 I llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
0.01.215.487 I llm_load_print_meta: EOG token = 151645 '<|im_end|>'
0.01.215.487 I llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
0.01.215.488 I llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
0.01.215.488 I llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
0.01.215.489 I llm_load_print_meta: max token length = 256
0.02.789.838 I llm_load_tensors: offloading 64 repeating layers to GPU
0.02.789.845 I llm_load_tensors: offloading output layer to GPU
0.02.789.846 I llm_load_tensors: offloaded 65/65 layers to GPU
0.02.789.863 I llm_load_tensors: CPU_Mapped model buffer size = 22178.80 MiB
0.02.789.865 I llm_load_tensors: CANN0 model buffer size = 4.27 MiB
.................................................................................................
0.02.813.855 I llama_new_context_with_model: n_seq_max = 4
0.02.813.867 I llama_new_context_with_model: n_ctx = 8192
0.02.813.868 I llama_new_context_with_model: n_ctx_per_seq = 2048
0.02.813.869 I llama_new_context_with_model: n_batch = 2048
0.02.813.869 I llama_new_context_with_model: n_ubatch = 512
0.02.813.871 I llama_new_context_with_model: flash_attn = 0
0.02.813.903 I llama_new_context_with_model: freq_base = 1000000.0
0.02.813.909 I llama_new_context_with_model: freq_scale = 1
0.02.813.914 W llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
0.02.922.429 I llama_kv_cache_init: CANN0 KV buffer size = 2048.00 MiB
0.02.922.448 I llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
0.02.922.755 I llama_new_context_with_model: CANN_Host output buffer size = 2.32 MiB
0.02.936.736 I llama_new_context_with_model: CANN0 compute buffer size = 686.00 MiB
0.02.936.748 I llama_new_context_with_model: CANN_Host compute buffer size = 307.00 MiB
0.02.936.749 I llama_new_context_with_model: graph nodes = 2246
0.02.936.749 I llama_new_context_with_model: graph splits = 643
0.02.936.752 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.35.929.924 I srv main: initializing server
0.35.929.936 I srv init: initializing slots, n_slots = 4
0.35.929.939 I slot init: id 0 | task -1 | new slot n_ctx_slot = 2048
0.35.929.953 I slot init: id 1 | task -1 | new slot n_ctx_slot = 2048
0.35.930.017 I slot init: id 2 | task -1 | new slot n_ctx_slot = 2048
0.35.930.067 I slot init: id 3 | task -1 | new slot n_ctx_slot = 2048
0.35.931.327 I srv main: chat template, built_in: 1, chat_example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
0.35.931.333 I srv main: starting server
1.17.322.386 I srv oaicompat_completions_req: params: {"messages":"[...]","model":"qwen2","stream":false,"temperature":0}
1.17.325.968 I slot launch_slot_with_task: id 0 | task 0 | processing task, max_tps = N/A
1.17.325.987 I slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 10
现在我看到
0.02.789.838 I llm_load_tensors: offloading 64 repeating layers to GPU
0.02.789.845 I llm_load_tensors: offloading output layer to GPU
0.02.789.846 I llm_load_tensors: offloaded 65/65 layers to GPU
但是curl调用依然卡死
我看到一个文档链接,其中提到: 从 ModelScope 部署 Qwen 2.5 的全系列模型,目前 CANN 算子的支持完整度方面还有不足,目前只能运行 FP16 精度、Q8_0 和Q4_0 量化的模型,建议运行 FP16 精度的模型 所以我估计我可能需要量化模型,🙏
CANN only supports FP16/Q8/Q4, other types will fallback to CPU computation.
如果我的模型有多个gguf拆分,命令参数可以支持吗?例如 qwen2.5-32b-instruct-q5_k_m*.gguf 包含多个文件,我必须要merge它成我一个文件吗?