Using vLLM to deploy LLM as an API to accelerate inference

lllyasviel / Omost

Your image is almost there!

Apache License 2.0

7.34k stars 422 forks source link

Using vLLM to deploy LLM as an API to accelerate inference #100

Open fx-hit opened 5 months ago

fx-hit commented 5 months ago

Based on practical tests, deploying omost-llama-3-8b on an A100 using torch==2.3.0+cu118, vllm==0.5.0.post1+cu118, and xformers==0.0.26.post1+cu118 works well. if want to speed up the process, can refer to this setup.

vllm: https://docs.vllm.ai/en/stable/getting_started/quickstart.html

badcookie78 commented 5 months ago

Hi, Can I know if is possible to run in with ollama and then host the LLM locally?

zk19971101 commented 4 months ago

I find comfyui_omost show a way to accelerate inference by TGI(text generation inference). https://github.com/huchenlei/ComfyUI_omost?tab=readme-ov-file#accelerating-llm

sudanl commented 2 months ago

Based on practical tests, deploying omost-llama-3-8b on an A100 using torch==2.3.0+cu118, vllm==0.5.0.post1+cu118, and xformers==0.0.26.post1+cu118 works well. if want to speed up the process, can refer to this setup.

vllm: https://docs.vllm.ai/en/stable/getting_started/quickstart.html

Good idea! Could you kindly share the code?