intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.55k stars 1.25k forks source link

vLLM CPU example load-in-low-bit is not used #11360

Open noobHappylife opened 3 months ago

noobHappylife commented 3 months ago

During testing with the --load-in-low-bit features with the vLLM for CPU example. I noticed the model is not optimized based on this option.

I found that it needs to pass in the load_in_low_bit argument explicitly in the api_server.py (as follows) for the model to be optimized with the option properly.

engine = IPEXLLMAsyncLLMEngine.from_engine_args(
        engine_args, usage_context=UsageContext.OPENAI_API_SERVER, load_in_low_bit=args.load_in_low_bit)
xiangyuT commented 3 months ago

Hi @noobHappylife, Thanks for pointing the no explicit load_in_low_bit value problem in vllm/cpu/entrypoints/openai/api_server.py. It will be fixed soon by this. Update here once it is solved.

noobHappylife commented 3 months ago

Thank you.