Closed Strongorange closed 8 months ago
vLLM provides an OpenAI compatible API server that you can deploy with docker easily.
If you would really like to build your own API server to serve concurrent requests, you should be usingAsyncLLMEngine
, and I would suggest you to look at the implementation in https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py and https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py on how to do so.
Closing as @ywang96's answer is correct
vLLM provides an OpenAI compatible API server that you can deploy with docker easily.
If you would really like to build your own API server to serve concurrent requests, you should be using
AsyncLLMEngine
, and I would suggest you to look at the implementation in https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py and https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py on how to do so.
Should I run this script openai/api_server.py to get the concurrency? and parallel processing of hundred request at a time?
Yes
How can i impement it in a RAG system?
I would like to know as well, I am using LlamaIndex Vllm: https://docs.llamaindex.ai/en/stable/api_reference/llms/vllm/
I would like to know as well, I am using LlamaIndex Vllm: https://docs.llamaindex.ai/en/stable/api_reference/llms/vllm/
I too, did you find any solutions?
OS: ubuntu 20.04 (Google Colab) GPU : Nvidia T4 15GB, A100 40GB (Google Colab)
The OrionStarAI/Orion-14B-Chat-Int4 quantization model is being tested in a FastAPI environment with vLLM.
When testing requests, there is no problem processing them one by one, but if another request is received before the answer is generated, the following error occurs.