[Misc]: Why doesn't a larger block size result in faster performance?

Anything you want to discuss about vllm.

Hello. Recently, I have been conducting experiments based on several hypotheses, but the results have been different from what I expected, so I am seeking your advice.

Hypothesis

I hypothesized that a larger block size would decrease throughput but improve latency. Conversely, I expected a smaller block size to increase throughput but worsen latency. The reasoning behind this is as follows:

For Larger Block Size:
- Throughput: Decreases
- Latency: Improves
- Reason: With a larger block size, memory fragmentation increases, leading to inefficient use of memory space. However, the number of blocks to manage decreases, thereby reducing the overhead from block table management.
For Smaller Block Size:
- Throughput: Increases
- Latency: Worsens
- Reason: With a smaller block size, memory fragmentation decreases, making memory use more efficient. However, the number of blocks to manage increases, thereby increasing the overhead from block table management.

Experimental Setup

Model: Llama-2-7b-hf, utilizing 12.5523 GB of memory.
GPU Memory Utilization: Set to gpu_memory_utilization=0.9 (default value).
Other Options: enable_chunked_prefill=True enabled.
Case 1: Block size 16 (3780 GPU blocks)
Case 2: Block size 128 (472 GPU blocks)
Experimental Environment: The same number of input and output tokens (min_tokens and max_token values are identical).

Experiment

fig

Fig 1:

Continuous fixed requests sent to VLLM. 256 requests processed simultaneously.
Each time one request arrives, another request is sent.
The system processes tokens of various lengths simultaneously (e.g., 4 tokens of 890~1000 length, 1 token of 600-890 length, 16 tokens of 100-226 length, and 256 tokens of 80-100 length).

Fig 2:

12 requests sent per second.
Total number of requests is 1299, with a Coefficient of Variation (CV) of 2.
The number of output tokens ranges from 125 to 128.
Unlike Fig 1, the request batch size can vary.

Experimental Results

Figure 1: Block size 16 (12.23), Block size 128 (12.17)
Figure 2: Block size 16 (3.98), Block size 128 (3.99)

Results

The experiment results showed almost no performance difference in both cases. Additionally, with a block size of 16, the average memory usage was about 50%, while with a block size of 128, it was about 70%. The difference in memory usage due to fragmentation was about 10-20%, but there was no pending since the GPU memory space was used above 99.9%.

Main Question

Why are the experimental results not reflecting the assumption that "Smaller block sizes reduce memory fragmentation but increase management overhead, and larger block sizes increase memory fragmentation but reduce management overhead"?

vllm-project / vllm