Closed xjw00654 closed 3 months ago
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
Hello! Have you solved this problem? This problem also occurred when I tested multiple concurrency. runnning_req always 1, I feel that there is no concurrency.
Any updates on this issue?
I just use this command to start the server
CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --model-path LLMs/Qwen-14B-Chat --port 30000 --trust-remote-code --stream-interval 1 --enable-flashinfer --schedule-conservativeness 50
and using the following code to test the concurrent capability.It can only generate code with ~10tokens/s whereas the vllm can be ~30tokens/s. it seems the call method does not support batch inferencing. the logs show as below: there is always 1
runnning_req
.The question is should we do it myself to support the batch inferencing when API calling or is something wrong with my setup? BTW, I also tried the
batching
example from the README, and it works fine and running faster then I expected!!!Thank you so much ahead.
SCRIPTS