Maximum Concurrent Request -- round-robin or sample multiple vLLMs

llmwesee commented 1 year ago

I would like to inquire about the concurrent request capacity of an NVIDIA A100 80 GB GPU when employing the Llama-2-13b model with full document capabilities within the H2OGPT ecosystem for production purposes. Specifically, I am interested in understanding the optimal level of concurrency attainable when operating on a local server, with a primary focus on achieving low latency and maximising throughput. And How can we dividing the request on multiple instances of A100 GPU & what should be the max _batch_size? How much GPUs power required to serve like 100 requests concurrently when running Llama-2-13b model?

pseudotensor commented 1 year ago

Hi, sorry for late response. Busy preparing for conference.

vLLM is best option for concurrency, and can handle a load of about 64 queries, so we tend to set h2oGPT's concurrency to 64 when feeding an LLM using vLLM based upon A100.

If you want to do more than 64 concurrent requests, probably good idea to use 2 GPUs and run A100 * 40GB instead, then round-robin the LLMs inside h2oGPT.

There's no code for that, but it's easy to add for API case. One would use model lock to have 2 vLLM endpoints as normal, but inside h2oGPT you could have visible_models_to_model_choice return random value from 0 to len of visible_models1.

For UI, you'd set visible models to only one of the 13B vLLMs, but you'd change this:

for chatboti, (chatbot1, model_state1) in enumerate(zip(chatbots, model_states1)):

inside all_bot() so that once it reaches this condition if visible_list[chatboti]: instead of going by visible_list you'd instead just randomly choose which chatboti to use.

llmwesee commented 1 year ago

Thanks for your response. I successfully integrate vllm engine using h2ogpt ecosystem. INFO 11-09 10:14:03 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2-13b-chat-hf', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=1234) INFO 11-09 10:14:25 llm_engine.py:207] # GPU blocks: 3802, # CPU blocks: 327

for inference through Llama-2-13b-chat-hf with user_data inside h2ogpt ecosysteem. Using Model meta-llama/llama-2-13b-chat-hf load INSTRUCTOR_Transformer max_seq_length 512 Starting get_model: meta-llama/Llama-2-13b-chat-hf vllm:0.0.0.0:5000

Now I want to know how this vllm engine handle maximum concurrent request? and should i hit the following command for serving 64 requests parallelly? python generate.py --inference_server="vllm:0.0.0.0:5000" --base_model=meta-llama/Llama-2-13b-chat-hf --score_model=None --langchain_mode='UserData' --user_path=user_path --use_auth_token=True --max_seq_len=4096 --max_max_new_tokens=2048 --max_batch_size=16 --concurrency_count=64

or there is any other methods? and If so then please guide me for implementing it.

pseudotensor commented 1 year ago

Yes, that --concurrency_count=64 is the right option to ensure requests hitting h2oGPT are concurrent without queue up to 64 requests. vLLM itself has no limit, so h2oGPT will push up to 64 to it.

As I mentioned, 64 for a single vLLM 13B might be a bit tough on it. We've seen vLLM have connection errors sometimes under load, so it would be wise to use 2+ 13Bs if possible and round-robin them in the way I described.

ishaan-jaff commented 1 year ago

@pseudotensor @llmwesee
I'm the maintainer of LiteLLM we implemented a request queue for making LLM API calls (to any LLM)/ Our queue can handle 100+ request/second

I believe this makes it easier to manage multiple vLLM deployments + request queuing. I'd love your feedback if not

Here's a quick start on this: docs: https://docs.litellm.ai/docs/routing#queuing-beta

Add Redis credentials in a .env file

REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted

Start litellm server with your model config

$ litellm --config /path/to/config.yaml --use_queue

Here's an example config for gpt-3.5-turbo

config.yaml (This will load balance between OpenAI + Azure endpoints)

model_list: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: gpt-3.5-turbo
      api_key: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: azure/chatgpt-v-2 # actual model name
      api_key: 
      api_version: 2023-07-01-preview
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/

Test (in another window) → sends 100 simultaneous requests to the queue

$ litellm --test_async --num_requests 100

Available Endpoints

/queue/request - Queues a /chat/completions request. Returns a job id.
/queue/response/{id} - Returns the status of a job. If completed, returns the response as well. Potential status's are: queued and finished.

h2oai / h2ogpt

Maximum Concurrent Request -- round-robin or sample multiple vLLMs #997

Available Endpoints