Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.62k
stars
1.26k
forks
source link
Running vLLM service benchmark(4xARC770) with Qwen1.5-32B-Chat model failed #11956
It seems that the gpu-memory-utilization is too high and causing the card 1 OOM when first_token is computed. You can reduce it to 0.85 can try ir again.
Environment: Platform: 6548N+4ARC770 Docker Image: intelanalytics/ipex-llm-serving-xpu:2.1.0 servicing script:
Error info: 1.With Dtype SYM_INT4 could succeed. 2.With Dtype FP8 failed with concurrency>=4. No error for concurrency 1 and 2. 2.GPU card 0 shows N/A utilization, card 1 2 3 work well: 3.Servicing side error log: 4.Client error info: