vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.24k stars 3.99k forks source link

[Performance]: Qwen2-72B-Instruction-GPTQ-Int4 Openai Server Request Problem #5407

Open syngokhan opened 3 months ago

syngokhan commented 3 months ago

Hello, I wish you good work.

When I use the Qwen2-72B-Instruction-GPTQ-Int4 model, when the model works on Vllm, it collects all the requests at first and then responds when receiving multiple requests. But when I use your Qwen2-7B-Instruction model, it receives them in a scattered manner.

While I do not have such a problem with other models, I see a problem with Qwen2-72B-Instruction and other quantization models receiving the requests completely and then responding. I would be glad if you help.

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model /opt/GPT/MODEL/Qwen2-72B-Instruct-GPTQ-Int4 --host 10.12.112.160 --port 9001 --max-model-len 8192 --tensor-parallel-size 1

robertgshaw2-neuralmagic commented 3 months ago

We are happy to help, but you need to clarify more what you are seeing because I cannot follow the abve well

It is most useful if you can show: