vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.17k stars 4.56k forks source link

text_generation_router::infer: router/src/infer.rs:130: no permits available #4936

Open Ling-CF opened 5 months ago

Ling-CF commented 5 months ago

Your current environment

python benchmark_serving.py --backend tgi --model /model/Mixtral_email_sft --dataset /usr/src/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --port 8080 --num-prompts 256 --endpoint /generate_stream --request-rate 32 --trust-remote-code

🐛 Describe the bug

[Bug]: 2024-05-21T06:25:30.209697Z ERROR generate_stream{parameters=GenerateParameters { best_of: Some(1), temperature: Some(0.01), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: Some(0.99), typical_p: None, do_sample: true, max_new_tokens: Some(393), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream: text_generation_router::infer: router/src/infer.rs:130: no permits available

Hello,when I was running the benchmark_serving.py on the TGI backend, I got the above error. When I defined the --num-prompts as 256, and the --request-rate as 32, I end up with less than 256 successful requests (the --max-concurrent-requests was set to 200).

Can anyone help me? Thank you

fchavat commented 2 months ago

Hi. I wonder if you could find a solution for this error. I'm facing the same problem when serving an LLM through Hugging Face TGI.

Thank you!