Hello. Recently, I have been conducting experiments based on several hypotheses, but the results have been different from what I expected, so I am seeking your advice.
Hypothesis
I hypothesized that a larger block size would decrease throughput but improve latency. Conversely, I expected a smaller block size to increase throughput but worsen latency. The reasoning behind this is as follows:
For Larger Block Size:
Throughput: Decreases
Latency: Improves
Reason: With a larger block size, memory fragmentation increases, leading to inefficient use of memory space. However, the number of blocks to manage decreases, thereby reducing the overhead from block table management.
For Smaller Block Size:
Throughput: Increases
Latency: Worsens
Reason: With a smaller block size, memory fragmentation decreases, making memory use more efficient. However, the number of blocks to manage increases, thereby increasing the overhead from block table management.
Experimental Setup
Model: Llama-2-7b-hf, utilizing 12.5523 GB of memory.
GPU Memory Utilization: Set to gpu_memory_utilization=0.9 (default value).
Other Options: enable_chunked_prefill=True enabled.
Case 1: Block size 16 (3780 GPU blocks)
Case 2: Block size 128 (472 GPU blocks)
Experimental Environment: The same number of input and output tokens (min_tokens and max_token values are identical).
Experiment
Fig 1:
Continuous fixed requests sent to VLLM. 256 requests processed simultaneously.
Each time one request arrives, another request is sent.
The system processes tokens of various lengths simultaneously (e.g., 4 tokens of 890~1000 length, 1 token of 600-890 length, 16 tokens of 100-226 length, and 256 tokens of 80-100 length).
Fig 2:
12 requests sent per second.
Total number of requests is 1299, with a Coefficient of Variation (CV) of 2.
The number of output tokens ranges from 125 to 128.
The experiment results showed almost no performance difference in both cases. Additionally, with a block size of 16, the average memory usage was about 50%, while with a block size of 128, it was about 70%. The difference in memory usage due to fragmentation was about 10-20%, but there was no pending since the GPU memory space was used above 99.9%.
Main Question
Why are the experimental results not reflecting the assumption that "Smaller block sizes reduce memory fragmentation but increase management overhead, and larger block sizes increase memory fragmentation but reduce management overhead"?
Anything you want to discuss about vllm.
Hello. Recently, I have been conducting experiments based on several hypotheses, but the results have been different from what I expected, so I am seeking your advice.
Hypothesis
I hypothesized that a larger block size would decrease throughput but improve latency. Conversely, I expected a smaller block size to increase throughput but worsen latency. The reasoning behind this is as follows:
Experimental Setup
Experiment
Fig 1:
Fig 2:
Experimental Results
Results
The experiment results showed almost no performance difference in both cases. Additionally, with a block size of 16, the average memory usage was about 50%, while with a block size of 128, it was about 70%. The difference in memory usage due to fragmentation was about 10-20%, but there was no pending since the GPU memory space was used above 99.9%.
Main Question
Why are the experimental results not reflecting the assumption that "Smaller block sizes reduce memory fragmentation but increase management overhead, and larger block sizes increase memory fragmentation but reduce management overhead"?