6K input OOM on ARC with VLLM-serving

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Apache License 2.0

6.43k stars 1.24k forks source link

6K input OOM on ARC with VLLM-serving #11561

Open jessie-zhao opened 1 month ago

jessie-zhao commented 1 month ago

Faced OOM on Arc with 6k input/512 out with VLLM serving, Mode: ChatGLM3-bB, Qwen1.5-32B on 4 ARC

hzjane commented 1 month ago

We have verified 6k input/512 out with VLLM serving with ChatGLM3-bB on 2 ARC, Qwen1.5-32B on 4 ARC.