vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.04k stars 4.54k forks source link

Concurrent timeout #6009

Open luhairong11 opened 4 months ago

luhairong11 commented 4 months ago

Your current environment

Start Service Command: python -m vllm.entrypoints.openai.api_server --model /data/Qwen1.5-1.8B-Chat-GPTQ-Int4 --served-model-name Qwen1.5-1.8B-Chat-GPTQ-Int4 --quantization gptq --dtype float16 --gpu-memory-utilization 0.2 --tensor-parallel-size 1 --trust-remote-code --max-model-len 4096 --served-model-name qwen1.5-1.8b

🐛 Describe the bug

image When testing with 50 concurrent requests, there are 4 failures with a prompt of connection timeout. How should this issue be resolved?

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!