gpustack / llama-box

LM inference server implementation based on llama.cpp.
MIT License
36 stars 5 forks source link

"Huawei Ascend CANN 8.0" releases not support Ubuntu 22.04 #5

Closed chinamerp closed 2 days ago

chinamerp commented 2 weeks ago

as title

thxCode commented 2 weeks ago

Please provide a detailed log, your OS information, or anything else you think is useful to help us debug.

chinamerp commented 2 weeks ago

使用最新版llama-box推理Qwen2.5-7B-Instruct-GGUF,得到类似如下乱码: “中国首都GDPG排名G第一的城市G是G北京市G。北京GGGG是中国GGG的政治、文化、国际GGG经济中心G”

环境如下: Downloading Model to directory: /data/modals/cache/model_scope/hub/Qwen/Qwen2.5-7B-Instruct-GGUF 0.00.664.756 I 0.00.664.763 I version: v0.0.76 (c91b501) 0.00.664.764 I compiler: cc (Ubuntu 11.4.0-2ubuntu1~20.04) 11.4.0 0.00.664.764 I target: aarch64-linux-gnu 0.00.664.764 I vendor: 0.00.664.765 I - llama.cpp 54ef9cfc (367) 0.00.664.766 I - stable-diffusion.cpp abef683 (171) 0.00.666.833 I system_info: n_threads = 256 (n_threads_batch = 256) / 256 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 0.00.666.837 I 0.00.666.901 I srv main: listening, hostname = 0.0.0.0, port = 40255, n_threads = 6 + 2 0.00.668.175 I srv main: loading model 0.00.668.480 I llama_load_model_from_file: using device CANN0 (Ascend910B3) - 62078 MiB free 0.00.716.984 I llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /data/modals/cache/model_scope/Qwen/Qwen2___5-7B-Instruct-GGUF/qwen2.5-7b-instruct-fp16.gguf (version GGUF V3 (latest)) 0.00.717.002 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 0.00.717.006 I llama_model_loader: - kv 0: general.architecture str = qwen2 0.00.717.008 I llama_model_loader: - kv 1: general.type str = model 0.00.717.010 I llama_model_loader: - kv 2: general.name str = qwen2.5-7b-instruct 0.00.717.011 I llama_model_loader: - kv 3: general.version str = v0.1 0.00.717.013 I llama_model_loader: - kv 4: general.finetune str = qwen2.5-7b-instruct 0.00.717.014 I llama_model_loader: - kv 5: general.size_label str = 7.6B 0.00.717.016 I llama_model_loader: - kv 6: qwen2.block_count u32 = 28 0.00.717.017 I llama_model_loader: - kv 7: qwen2.context_length u32 = 131072 0.00.717.017 I llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584 0.00.717.018 I llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944 0.00.717.019 I llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28 0.00.717.020 I llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4 0.00.717.030 I llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000 0.00.717.033 I llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 0.00.717.034 I llama_model_loader: - kv 14: general.file_type u32 = 1 0.00.717.035 I llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 0.00.717.036 I llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 0.00.743.531 I llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 0.00.749.367 I llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 0.00.775.297 I llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 0.00.775.306 I llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 0.00.775.306 I llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 0.00.775.307 I llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 0.00.775.308 I llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false 0.00.775.312 I llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 0.00.775.313 I llama_model_loader: - kv 25: general.quantization_version u32 = 2 0.00.775.314 I llama_model_loader: - kv 26: split.no u16 = 0 0.00.775.315 I llama_model_loader: - kv 27: split.count u16 = 0 0.00.775.316 I llama_model_loader: - kv 28: split.tensors.count i32 = 339 0.00.775.317 I llama_model_loader: - type f32: 141 tensors 0.00.775.318 I llama_model_loader: - type f16: 198 tensors 0.00.982.202 I llm_load_vocab: special tokens cache size = 22 0.01.047.916 I llm_load_vocab: token to piece cache size = 0.9310 MB 0.01.047.943 I llm_load_print_meta: format = GGUF V3 (latest) 0.01.047.944 I llm_load_print_meta: arch = qwen2 0.01.047.945 I llm_load_print_meta: vocab type = BPE 0.01.047.946 I llm_load_print_meta: n_vocab = 152064 0.01.047.947 I llm_load_print_meta: n_merges = 151387 0.01.047.947 I llm_load_print_meta: vocab_only = 0 0.01.047.948 I llm_load_print_meta: n_ctx_train = 131072 0.01.047.948 I llm_load_print_meta: n_embd = 3584 0.01.047.948 I llm_load_print_meta: n_layer = 28 0.01.047.963 I llm_load_print_meta: n_head = 28 0.01.047.965 I llm_load_print_meta: n_head_kv = 4 0.01.047.966 I llm_load_print_meta: n_rot = 128 0.01.047.967 I llm_load_print_meta: n_swa = 0 0.01.047.968 I llm_load_print_meta: n_embd_head_k = 128 0.01.047.969 I llm_load_print_meta: n_embd_head_v = 128 0.01.047.971 I llm_load_print_meta: n_gqa = 7 0.01.047.972 I llm_load_print_meta: n_embd_k_gqa = 512 0.01.047.975 I llm_load_print_meta: n_embd_v_gqa = 512 0.01.047.976 I llm_load_print_meta: f_norm_eps = 0.0e+00 0.01.047.977 I llm_load_print_meta: f_norm_rms_eps = 1.0e-06 0.01.047.978 I llm_load_print_meta: f_clamp_kqv = 0.0e+00 0.01.047.979 I llm_load_print_meta: f_max_alibi_bias = 0.0e+00 0.01.047.981 I llm_load_print_meta: f_logit_scale = 0.0e+00 0.01.047.984 I llm_load_print_meta: n_ff = 18944 0.01.047.985 I llm_load_print_meta: n_expert = 0 0.01.047.985 I llm_load_print_meta: n_expert_used = 0 0.01.047.986 I llm_load_print_meta: causal attn = 1 0.01.047.986 I llm_load_print_meta: pooling type = 0 0.01.047.987 I llm_load_print_meta: rope type = 2 0.01.047.988 I llm_load_print_meta: rope scaling = linear 0.01.047.990 I llm_load_print_meta: freq_base_train = 1000000.0 0.01.047.991 I llm_load_print_meta: freq_scale_train = 1 0.01.047.992 I llm_load_print_meta: n_ctx_orig_yarn = 131072 0.01.047.993 I llm_load_print_meta: rope_finetuned = unknown 0.01.047.994 I llm_load_print_meta: ssm_d_conv = 0 0.01.047.994 I llm_load_print_meta: ssm_d_inner = 0 0.01.047.994 I llm_load_print_meta: ssm_d_state = 0 0.01.047.995 I llm_load_print_meta: ssm_dt_rank = 0 0.01.047.995 I llm_load_print_meta: ssm_dt_b_c_rms = 0 0.01.047.996 I llm_load_print_meta: model type = 7B 0.01.047.998 I llm_load_print_meta: model ftype = F16 0.01.047.999 I llm_load_print_meta: model params = 7.62 B 0.01.048.001 I llm_load_print_meta: model size = 14.19 GiB (16.00 BPW) 0.01.048.002 I llm_load_print_meta: general.name = qwen2.5-7b-instruct 0.01.048.003 I llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 0.01.048.004 I llm_load_print_meta: EOS token = 151645 '<|im_end|>' 0.01.048.005 I llm_load_print_meta: EOT token = 151645 '<|im_end|>' 0.01.048.005 I llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 0.01.048.006 I llm_load_print_meta: LF token = 148848 'ÄĬ' 0.01.048.007 I llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 0.01.048.007 I llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 0.01.048.008 I llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 0.01.048.009 I llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 0.01.048.010 I llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 0.01.048.010 I llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 0.01.048.011 I llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 0.01.048.012 I llm_load_print_meta: EOG token = 151645 '<|im_end|>' 0.01.048.012 I llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 0.01.048.014 I llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 0.01.048.015 I llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 0.01.048.015 I llm_load_print_meta: max token length = 256 0.01.075.453 I llm_load_tensors: offloading 28 repeating layers to GPU 0.01.075.460 I llm_load_tensors: offloading output layer to GPU 0.01.075.461 I llm_load_tensors: offloaded 29/29 layers to GPU 0.01.075.469 I llm_load_tensors: CANN_Host model buffer size = 1039.50 MiB 0.01.075.470 I llm_load_tensors: CANN0 model buffer size = 13486.77 MiB ........................................................................................ 0.05.253.347 I llama_new_context_with_model: n_seq_max = 4 0.05.253.356 I llama_new_context_with_model: n_ctx = 8192 0.05.253.357 I llama_new_context_with_model: n_ctx_per_seq = 2048 0.05.253.357 I llama_new_context_with_model: n_batch = 2048 0.05.253.357 I llama_new_context_with_model: n_ubatch = 512 0.05.253.359 I llama_new_context_with_model: flash_attn = 0 0.05.253.369 I llama_new_context_with_model: freq_base = 1000000.0 0.05.253.371 I llama_new_context_with_model: freq_scale = 1 0.05.253.373 W llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized 0.05.277.347 I llama_kv_cache_init: CANN0 KV buffer size = 448.00 MiB 0.05.277.355 I llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB 0.05.277.401 I llama_new_context_with_model: CANN_Host output buffer size = 0.05 MiB 0.05.281.827 I llama_new_context_with_model: CANN0 compute buffer size = 492.00 MiB 0.05.281.835 I llama_new_context_with_model: CANN_Host compute buffer size = 23.01 MiB 0.05.281.836 I llama_new_context_with_model: graph nodes = 986 0.05.281.837 I llama_new_context_with_model: graph splits = 2 0.05.281.839 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) 0.06.138.322 I srv main: initializing server 0.06.138.332 I srv init: initializing slots, n_slots = 4 0.06.138.335 I slot init: id 0 | task -1 | new slot n_ctx_slot = 2048 0.06.138.345 I slot init: id 1 | task -1 | new slot n_ctx_slot = 2048 0.06.138.348 I slot init: id 2 | task -1 | new slot n_ctx_slot = 2048 0.06.138.351 I slot init: id 3 | task -1 | new slot n_ctx_slot = 2048 0.06.138.544 I srv main: chat template, built_in: 1, chat_example: <|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant 0.06.138.544 I srv main: starting server

thxCode commented 2 weeks ago

can you try with the main built artifact from action https://github.com/gpustack/llama-box/actions/runs/11832443367/job/32969280706?

here is the download https://github.com/gpustack/llama-box/actions/runs/11832443367/artifacts/2186204101.

thxCode commented 2 weeks ago

Device:

image

0.5B-FP16 -> NORMAL

image

3B-FP16 -> NORMAL

image

7B-FP16 -> ABNORMAL

image

7B-FP16(with flash-attention) -> NORMAL

image
thxCode commented 2 days ago

@chinamerp you can try with -fa. please reopen if there are any new problems.