microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table
MIT License
420 stars 32 forks source link

【Qwen】Could you please update 3rd/llama.cpp to support Qwen1.5 or Qwen2 ? #27

Open tiger-of-shawn opened 3 weeks ago

tiger-of-shawn commented 3 weeks ago

warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored warning: see main README.md for information on enabling GPU BLAS support Log start main: build = 2854 (70c312d) main: built with clang version 17.0.6 (http://git.linaro.org/toolchain/jenkins-scripts.git 09f505cadfbe9987730e641398ab9a2ca0cdb67f) for aarch64-unknown-linux-gnu main: seed = 1724291253 [09:47:33] T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init llama_model_loader: loaded meta data with 20 key-value pairs and 387 tensors from models/Qwen1.5-1.8B-Chat-GPTQ-Int4/ggml-model.in.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = Qwen1.5-1.8B-Chat-GPTQ-Int4 llama_model_loader: - kv 2: qwen2.block_count u32 = 24 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 2048 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 5504 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 32 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - type f32: 217 tensors llama_model_loader: - type f16: 2 tensors llama_model_loader: - type i4: 168 tensors llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error:

FranzKafkaYu commented 3 weeks ago

+1

kaleid-liner commented 3 weeks ago

See #24 I'm working on it. Won't be very quick though.

zcxo commented 3 weeks ago

+1

zcxo commented 3 weeks ago

I strongly recommend that Qianwen official and T-MAC official collaborate with USTC to accelerate the end-to-end implementation and universal use of T-MAC and QWEN!

tiger-of-shawn commented 3 weeks ago

I strongly recommend that Qianwen official and T-MAC official collaborate with USTC to accelerate the end-to-end implementation and universal use of T-MAC and QWEN!

Currently, we have achieved high-performance operation of Qwen2 0.5B/1.5B based on Qualcomm/MTK/Intel chips using NPU and GPU on mobile phones, vehicle systems, and PCs. Considering power consumption issues, we do not currently recommend using CPU for backend acceleration on the edge. However, we are evaluating the actual performance of T-MAC.

FranzKafkaYu commented 3 weeks ago

I strongly recommend that Qianwen official and T-MAC official collaborate with USTC to accelerate the end-to-end implementation and universal use of T-MAC and QWEN!

Currently, we have achieved high-performance operation of Qwen2 0.5B/1.5B based on Qualcomm/MTK/Intel chips using NPU and GPU on mobile phones, vehicle systems, and PCs. Considering power consumption issues, we do not currently recommend using CPU for backend acceleration on the edge. However, we are evaluating the actual performance of T-MAC.

Recently I am working on deploying Qwen2 0.5B model in a MTK Soc,which will utilize MTK NeruoPilot SDK,do you use their SDK?and can you share some details such as benchmark result.thanks!

tiger-of-shawn commented 3 weeks ago

I strongly recommend that Qianwen official and T-MAC official collaborate with USTC to accelerate the end-to-end implementation and universal use of T-MAC and QWEN!

Currently, we have achieved high-performance operation of Qwen2 0.5B/1.5B based on Qualcomm/MTK/Intel chips using NPU and GPU on mobile phones, vehicle systems, and PCs. Considering power consumption issues, we do not currently recommend using CPU for backend acceleration on the edge. However, we are evaluating the actual performance of T-MAC.

Recently I am working on deploying Qwen2 0.5B model in a MTK Soc,which will utilize MTK NeruoPilot SDK,do you use their SDK?and can you share some details such as benchmark result.thanks!

Yes, we used APU/MNN to run Qwen2 LLM on MTK 9300/9000. However, we are the commercialization team for Qwen on device, and data such as performance metrics is part of our commercial delivery, so we cannot disclose it at this time. Sorry for the inconvenience.

caoshijie0501 commented 3 weeks ago

I strongly recommend that Qianwen official and T-MAC official collaborate with USTC to accelerate the end-to-end implementation and universal use of T-MAC and QWEN!

Currently, we have achieved high-performance operation of Qwen2 0.5B/1.5B based on Qualcomm/MTK/Intel chips using NPU and GPU on mobile phones, vehicle systems, and PCs. Considering power consumption issues, we do not currently recommend using CPU for backend acceleration on the edge. However, we are evaluating the actual performance of T-MAC.

Great to know this! Will you share the results or open source? btw, is the gemm/gemv of 0.5B/1.5B models in int8/fp16?

tiger-of-shawn commented 3 weeks ago

I strongly recommend that Qianwen official and T-MAC official collaborate with USTC to accelerate the end-to-end implementation and universal use of T-MAC and QWEN!

Currently, we have achieved high-performance operation of Qwen2 0.5B/1.5B based on Qualcomm/MTK/Intel chips using NPU and GPU on mobile phones, vehicle systems, and PCs. Considering power consumption issues, we do not currently recommend using CPU for backend acceleration on the edge. However, we are evaluating the actual performance of T-MAC.

Great to know this! Will you share the results or open source? btw, is the gemm/gemv of 0.5B/1.5B models in int8/fp16?

Currently, this information can only be obtained through Alibaba Cloud support tickets. Sorry for the inconvenience.

caoshijie0501 commented 3 weeks ago

I strongly recommend that Qianwen official and T-MAC official collaborate with USTC to accelerate the end-to-end implementation and universal use of T-MAC and QWEN!

Currently, we have achieved high-performance operation of Qwen2 0.5B/1.5B based on Qualcomm/MTK/Intel chips using NPU and GPU on mobile phones, vehicle systems, and PCs. Considering power consumption issues, we do not currently recommend using CPU for backend acceleration on the edge. However, we are evaluating the actual performance of T-MAC.

Great to know this! Will you share the results or open source? btw, is the gemm/gemv of 0.5B/1.5B models in int8/fp16?

Currently, this information can only be obtained through Alibaba Cloud support tickets. Sorry for the inconvenience.

If you don't recommend using CPU, why would you create an issue asking us to support Qwen? Maybe you are experts in power consumption, please share the real numbers and we would be happy to learn from you. Hope you can open src your high performance implementation in the future and really "普惠大众". btw, how to create alibaba cloud support ticket, does it mean paying money? we would like to try it and learn from you guys.

tiger-of-shawn commented 3 weeks ago

I strongly recommend that Qianwen official and T-MAC official collaborate with USTC to accelerate the end-to-end implementation and universal use of T-MAC and QWEN!

Currently, we have achieved high-performance operation of Qwen2 0.5B/1.5B based on Qualcomm/MTK/Intel chips using NPU and GPU on mobile phones, vehicle systems, and PCs. Considering power consumption issues, we do not currently recommend using CPU for backend acceleration on the edge. However, we are evaluating the actual performance of T-MAC.

Great to know this! Will you share the results or open source? btw, is the gemm/gemv of 0.5B/1.5B models in int8/fp16?

Currently, this information can only be obtained through Alibaba Cloud support tickets. Sorry for the inconvenience.

If you don't recommend using CPU, why would you create an issue asking us to support Qwen? Maybe you are experts in power consumption, please share the real numbers and we would be happy to learn from you. Hope you can open src your high performance implementation in the future and really "普惠大众". btw, how to create alibaba cloud support ticket, does it mean paying money? we would like to try it and learn from you guys.

  1. "Considering power consumption issues, we do not currently recommend using CPU for backend acceleration on the edge. ", so I'm trying T-MAC on Phone/PC.
  2. "alibaba cloud support ticket": please refer to https://smartservice.console.aliyun.com/service/create-ticket , it's one way to get detail information about "Qwen on device" , applying for a ticket is free.
Starrylun commented 2 weeks ago

+1

zcxo commented 1 week ago

Hi giger-of-shawn:

"Considering power consumption issues, we do not currently recommend using CPU for backend acceleration on the edge. ", so I'm trying T-MAC on Phone/PC.

I don't know if you have completed the experiment. Can you share the evaluation of T-Mac running on mobile phones? This is quite helpful for us to use the qwen model, thank you!