vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.1k stars 4.72k forks source link

[Misc]: Why doesn't a larger block size result in faster performance? #6868

Open KimMinSang96 opened 4 months ago

KimMinSang96 commented 4 months ago

Anything you want to discuss about vllm.

Hello. Recently, I have been conducting experiments based on several hypotheses, but the results have been different from what I expected, so I am seeking your advice.

Hypothesis

I hypothesized that a larger block size would decrease throughput but improve latency. Conversely, I expected a smaller block size to increase throughput but worsen latency. The reasoning behind this is as follows:

  1. For Larger Block Size:
    • Throughput: Decreases
    • Latency: Improves
    • Reason: With a larger block size, memory fragmentation increases, leading to inefficient use of memory space. However, the number of blocks to manage decreases, thereby reducing the overhead from block table management.
  2. For Smaller Block Size:
    • Throughput: Increases
    • Latency: Worsens
    • Reason: With a smaller block size, memory fragmentation decreases, making memory use more efficient. However, the number of blocks to manage increases, thereby increasing the overhead from block table management.

Experimental Setup

Experiment

fig

Fig 1:

Fig 2:

Experimental Results

Results

The experiment results showed almost no performance difference in both cases. Additionally, with a block size of 16, the average memory usage was about 50%, while with a block size of 128, it was about 70%. The difference in memory usage due to fragmentation was about 10-20%, but there was no pending since the GPU memory space was used above 99.9%.

Main Question

Why are the experimental results not reflecting the assumption that "Smaller block sizes reduce memory fragmentation but increase management overhead, and larger block sizes increase memory fragmentation but reduce management overhead"?

tempcollab commented 1 month ago

following up on this. Block size doesn't seem to have any effect