Running vLLM service benchmark(4xARC770) with Qwen1.5-32B-Chat model failed

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Apache License 2.0

6.62k stars 1.26k forks source link

Running vLLM service benchmark(4xARC770) with Qwen1.5-32B-Chat model failed #11956

Open dukelee111 opened 1 month ago

dukelee111 commented 1 month ago

Environment: Platform: 6548N+4ARC770 Docker Image: intelanalytics/ipex-llm-serving-xpu:2.1.0 servicing script:

Error info: 1.With Dtype SYM_INT4 could succeed. 2.With Dtype FP8 failed with concurrency>=4. No error for concurrency 1 and 2. 2.GPU card 0 shows N/A utilization, card 1 2 3 work well: 3.Servicing side error log: 4.Client error info:

hzjane commented 1 month ago

It seems that the gpu-memory-utilization is too high and causing the card 1 OOM when first_token is computed. You can reduce it to 0.85 can try ir again.