[Performance]: Profile & optimize the BlockManagerV2

cadedaniel commented 6 months ago

Proposal to improve performance

We've recently rewritten the block management subsystem for better testability. We need to profile it under real load to make sure it is performant enough to replace the block manager V1, and fix any issues.

We should do this once the block manager v2 is feature complete (still missing a few items).

Known issue:

Prefix caching num_total_tokens is O(N^2) instead of O(N) (see https://github.com/vllm-project/vllm/pull/4142#discussion_r1585245813)

cadedaniel commented 5 months ago

What we want to profile: For low-latency use case:

Batch size of 8-16 range
Various block sizes (16, 32, 128)
Sequence length (long context, 1.5k). Can set num_output_tokens=50.
For spec decode, also num_lookahead_tokens > 0. Try num_lookahead_tokens=5 (what is lookahead scheduling)

For high-throughput use-case:

Batch size up to 256
Various block sizes (16, 32, 128)
Sequence length (long context, 1.5k). Can set num_output_tokens=50.

Other cases that are important (perhaps we make separate tasks):

P0 prefix caching
P1 Beam search
P1 swapping
P1 sliding window

In terms of how to profile, use benchmark_latency + torch profiling (or can use CPU profiler of your choosing) https://github.com/vllm-project/vllm/blob/c3c2903e72c6e85a81ff6de8b879f4c82e8ad364/benchmarks/benchmark_latency.py#L178-L187

cadedaniel commented 5 months ago

@robertgshaw2-neuralmagic can you assign Alex

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

vllm-project / vllm

[Performance]: Profile & optimize the BlockManagerV2 #4536

Proposal to improve performance