[Usage]: How to test that chunked-prefill improves ITL/e2e latency almost 2X improvement at high QPS?

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

30.78k stars 4.67k forks source link

[Usage]: How to test that chunked-prefill improves ITL/e2e latency almost 2X improvement at high QPS? #7147

Open chqy99 opened 3 months ago

chqy99 commented 3 months ago

Your current environment

The performance of chunked-prefill.

How would you like to use vllm

I saw that chunked-prefill improves ITL/e2e latency almost 2X improvement at high QPS from https://github.com/vllm-project/vllm/issues/3130. But in vllm-0.5.3 and two A100, I only almost improves 10~20%. Please tell me more test information, such as server-launch instruct 、client-request instruct、GPU setting el.

jon-chuang commented 3 months ago

Obviously this varies based on your GPU model and your load tester.

Your load test cannot have a static schedule etc all prefill then all decode. Requests have to arrive at random times in order to concurrently schedule prefill with decode.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!