Open syngokhan opened 5 months ago
We are happy to help, but you need to clarify more what you are seeing because I cannot follow the abve well
It is most useful if you can show:
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Hello, I wish you good work.
When I use the Qwen2-72B-Instruction-GPTQ-Int4 model, when the model works on Vllm, it collects all the requests at first and then responds when receiving multiple requests. But when I use your Qwen2-7B-Instruction model, it receives them in a scattered manner.
While I do not have such a problem with other models, I see a problem with Qwen2-72B-Instruction and other quantization models receiving the requests completely and then responding. I would be glad if you help.
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model /opt/GPT/MODEL/Qwen2-72B-Instruct-GPTQ-Int4 --host 10.12.112.160 --port 9001 --max-model-len 8192 --tensor-parallel-size 1