provide support for model serving using FastAPI deepspeed+ipex-llm

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc

Apache License 2.0

6.72k stars 1.26k forks source link

provide support for model serving using FastAPI deepspeed+ipex-llm #10690

Open nazneenn opened 7 months ago

nazneenn commented 7 months ago

Hi, Could you please help provide guide on integrating deepspeed approach of using multi-GPU Intel Flex 140 to run model inference using FastAPI and uvicorn setting ? model id: 'meta-llama/Llama-2-7b-chat-hf' Thanks

glorysdj commented 7 months ago

Hi @nazneenn , we are developing a poc of FastAPI serving using multi-GPU, will keep you updated.

digitalscream commented 7 months ago

Watching this one - I'll be aiming to run Mixtral 8x7b AWQ on a pair of Arc A770s (I'll be buying the second GPU as soon as I know it's supported).

glorysdj commented 7 months ago

Hi @nazneenn @digitalscream FastAPI serving using multi-GPU is now supported in ipex-llm, please refer to this example https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Deepspeed-AutoTP-FastAPI