Open llmwesee opened 1 year ago
Hi, sorry for late response. Busy preparing for conference.
vLLM is best option for concurrency, and can handle a load of about 64 queries, so we tend to set h2oGPT's concurrency to 64 when feeding an LLM using vLLM based upon A100.
If you want to do more than 64 concurrent requests, probably good idea to use 2 GPUs and run A100 * 40GB instead, then round-robin the LLMs inside h2oGPT.
There's no code for that, but it's easy to add for API case. One would use model lock to have 2 vLLM endpoints as normal, but inside h2oGPT you could have visible_models_to_model_choice
return random value from 0 to len of visible_models1
.
For UI, you'd set visible models to only one of the 13B vLLMs, but you'd change this:
for chatboti, (chatbot1, model_state1) in enumerate(zip(chatbots, model_states1)):
inside all_bot()
so that once it reaches this condition if visible_list[chatboti]:
instead of going by visible_list
you'd instead just randomly choose which chatboti
to use.
Thanks for your response. I successfully integrate vllm engine using h2ogpt ecosystem.
INFO 11-09 10:14:03 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2-13b-chat-hf', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=1234) INFO 11-09 10:14:25 llm_engine.py:207] # GPU blocks: 3802, # CPU blocks: 327
for inference through Llama-2-13b-chat-hf with user_data inside h2ogpt ecosysteem.
Using Model meta-llama/llama-2-13b-chat-hf load INSTRUCTOR_Transformer max_seq_length 512 Starting get_model: meta-llama/Llama-2-13b-chat-hf vllm:0.0.0.0:5000
Now I want to know how this vllm engine handle maximum concurrent request? and should i hit the following command for serving 64 requests parallelly?
python generate.py --inference_server="vllm:0.0.0.0:5000" --base_model=meta-llama/Llama-2-13b-chat-hf --score_model=None --langchain_mode='UserData' --user_path=user_path --use_auth_token=True --max_seq_len=4096 --max_max_new_tokens=2048 --max_batch_size=16 --concurrency_count=64
or there is any other methods? and If so then please guide me for implementing it.
Yes, that --concurrency_count=64
is the right option to ensure requests hitting h2oGPT are concurrent without queue up to 64 requests. vLLM itself has no limit, so h2oGPT will push up to 64 to it.
As I mentioned, 64 for a single vLLM 13B might be a bit tough on it. We've seen vLLM have connection errors sometimes under load, so it would be wise to use 2+ 13Bs if possible and round-robin them in the way I described.
@pseudotensor @llmwesee
I'm the maintainer of LiteLLM we implemented a request queue for making LLM API calls (to any LLM)/ Our queue can handle 100+ request/second
I believe this makes it easier to manage multiple vLLM deployments + request queuing. I'd love your feedback if not
Here's a quick start on this: docs: https://docs.litellm.ai/docs/routing#queuing-beta
REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted
$ litellm --config /path/to/config.yaml --use_queue
Here's an example config for gpt-3.5-turbo
config.yaml (This will load balance between OpenAI + Azure endpoints)
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/chatgpt-v-2 # actual model name
api_key:
api_version: 2023-07-01-preview
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
$ litellm --test_async --num_requests 100
/queue/request
- Queues a /chat/completions request. Returns a job id. /queue/response/{id}
- Returns the status of a job. If completed, returns the response as well. Potential status's are: queued
and finished
.
I would like to inquire about the concurrent request capacity of an NVIDIA A100 80 GB GPU when employing the Llama-2-13b model with full document capabilities within the H2OGPT ecosystem for production purposes. Specifically, I am interested in understanding the optimal level of concurrency attainable when operating on a local server, with a primary focus on achieving low latency and maximising throughput. And How can we dividing the request on multiple instances of A100 GPU & what should be the max _batch_size? How much GPUs power required to serve like 100 requests concurrently when running Llama-2-13b model?