intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.55k stars 1.25k forks source link

provide support for model serving using FastAPI deepspeed+ipex-llm #10690

Open nazneenn opened 5 months ago

nazneenn commented 5 months ago

Hi, Could you please help provide guide on integrating deepspeed approach of using multi-GPU Intel Flex 140 to run model inference using FastAPI and uvicorn setting ? model id: 'meta-llama/Llama-2-7b-chat-hf' Thanks

glorysdj commented 5 months ago

Hi @nazneenn , we are developing a poc of FastAPI serving using multi-GPU, will keep you updated.

digitalscream commented 5 months ago

Watching this one - I'll be aiming to run Mixtral 8x7b AWQ on a pair of Arc A770s (I'll be buying the second GPU as soon as I know it's supported).

glorysdj commented 5 months ago

Hi @nazneenn @digitalscream FastAPI serving using multi-GPU is now supported in ipex-llm, please refer to this example https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Deepspeed-AutoTP-FastAPI