Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.55k
stars
1.25k
forks
source link
vLLM CPU example load-in-low-bit is not used #11360
During testing with the --load-in-low-bit features with the vLLM for CPU example. I noticed the model is not optimized based on this option.
I found that it needs to pass in the load_in_low_bit argument explicitly in the api_server.py (as follows) for the model to be optimized with the option properly.
Hi @noobHappylife,
Thanks for pointing the no explicit load_in_low_bit value problem in vllm/cpu/entrypoints/openai/api_server.py. It will be fixed soon by this. Update here once it is solved.
During testing with the --load-in-low-bit features with the vLLM for CPU example. I noticed the model is not optimized based on this option.
I found that it needs to pass in the load_in_low_bit argument explicitly in the api_server.py (as follows) for the model to be optimized with the option properly.