vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.43k stars 4.6k forks source link

[Performance]: Qwen2-72B-Instruction-GPTQ-Int4 Openai Server Request Problem #5407

Open syngokhan opened 5 months ago

syngokhan commented 5 months ago

Hello, I wish you good work.

When I use the Qwen2-72B-Instruction-GPTQ-Int4 model, when the model works on Vllm, it collects all the requests at first and then responds when receiving multiple requests. But when I use your Qwen2-7B-Instruction model, it receives them in a scattered manner.

While I do not have such a problem with other models, I see a problem with Qwen2-72B-Instruction and other quantization models receiving the requests completely and then responding. I would be glad if you help.

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model /opt/GPT/MODEL/Qwen2-72B-Instruct-GPTQ-Int4 --host 10.12.112.160 --port 9001 --max-model-len 8192 --tensor-parallel-size 1

robertgshaw2-neuralmagic commented 5 months ago

We are happy to help, but you need to clarify more what you are seeing because I cannot follow the abve well

It is most useful if you can show:

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!