Open fx-hit opened 4 months ago
Hi, Can I know if is possible to run in with ollama and then host the LLM locally?
I find comfyui_omost show a way to accelerate inference by TGI(text generation inference). https://github.com/huchenlei/ComfyUI_omost?tab=readme-ov-file#accelerating-llm
Based on practical tests, deploying omost-llama-3-8b on an A100 using torch==2.3.0+cu118, vllm==0.5.0.post1+cu118, and xformers==0.0.26.post1+cu118 works well. if want to speed up the process, can refer to this setup.
vllm: https://docs.vllm.ai/en/stable/getting_started/quickstart.html
Good idea! Could you kindly share the code?
Based on practical tests, deploying omost-llama-3-8b on an A100 using torch==2.3.0+cu118, vllm==0.5.0.post1+cu118, and xformers==0.0.26.post1+cu118 works well. if want to speed up the process, can refer to this setup.
vllm: https://docs.vllm.ai/en/stable/getting_started/quickstart.html