Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.75k
stars
1.27k
forks
source link
unexpected output when inference Qwen-7B-Chat-10-12 with 1024-128 in_out_pairs using transformer_int4_gpu api #9351
@hkvision The output was about 41 which didn't meet expectations when I tested the Qwen-7B-Chat-10-1t model with 1024-128 in_out_pairs using transformer_int4_gpu api. Please have a look.
@hkvision The output was about 41 which didn't meet expectations when I tested the Qwen-7B-Chat-10-1t model with 1024-128 in_out_pairs using transformer_int4_gpu api. Please have a look.
Env
bigdl-core-xe-2.4.0b20231102 bigdl-llm-2.4.0b20231102 intel-extension-for-pytorch-2.0.110+xpu torch-2.0.1a0+cxx11.abi torchvision-0.15.2a0+cxx11.abi