Open cadedaniel opened 6 months ago
What we want to profile: For low-latency use case:
For high-throughput use-case:
Other cases that are important (perhaps we make separate tasks):
In terms of how to profile, use benchmark_latency
+ torch profiling (or can use CPU profiler of your choosing) https://github.com/vllm-project/vllm/blob/c3c2903e72c6e85a81ff6de8b879f4c82e8ad364/benchmarks/benchmark_latency.py#L178-L187
@robertgshaw2-neuralmagic can you assign Alex
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Proposal to improve performance
We've recently rewritten the block management subsystem for better testability. We need to profile it under real load to make sure it is performant enough to replace the block manager V1, and fix any issues.
We should do this once the block manager v2 is feature complete (still missing a few items).
Known issue:
num_total_tokens
is O(N^2) instead of O(N) (see https://github.com/vllm-project/vllm/pull/4142#discussion_r1585245813)