Low parallel requests on Arc with VLLM serving

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Apache License 2.0

6.43k stars 1.24k forks source link

Low parallel requests on Arc with VLLM serving #11560

Open jessie-zhao opened 1 month ago

jessie-zhao commented 1 month ago

Got only 10 parallel request on 2 Arc with Qwen1.5 model (1024 input/512 out), could you please to improve the performance?

liu-shaojun commented 1 month ago

We will verify this in 2 Arc GPUs environment.

liu-shaojun commented 1 month ago

We have verified in 2 Arc GPUs that it can handle more than 10 parallel requests.