intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.43k stars 1.24k forks source link

OOM on multiple-ARC with vllm serving #11559

Open jessie-zhao opened 1 month ago

jessie-zhao commented 1 month ago

Run vllm serving test on ARC with below issue: NFO 07-04 19:10:08 async_llm_engine.py:152] Aborted request cmpl-e5fb5cad96e9402dabbbece3611ae22f-0. INFO: 127.0.0.1:41772 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi result = await app( # type: ignore[func-returns-value]

Model: ChatGLM3-6B / Qwen1.5-14 on 2Arc/Qwen1.5-32B on 4Arc HW: XeonW + Arc

gc-fu commented 1 month ago

Hi, we have fixed this issue in our new images. You can pull our new images to have a try~