Open chqy99 opened 3 months ago
Obviously this varies based on your GPU model and your load tester.
Your load test cannot have a static schedule etc all prefill then all decode. Requests have to arrive at random times in order to concurrently schedule prefill with decode.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
How would you like to use vllm
I saw that chunked-prefill improves ITL/e2e latency almost 2X improvement at high QPS from https://github.com/vllm-project/vllm/issues/3130. But in vllm-0.5.3 and two A100, I only almost improves 10~20%. Please tell me more test information, such as server-launch instruct 、client-request instruct、GPU setting el.