Open charai-frontend opened 1 month ago
cc @andoorve
This is kind of expected behavior based on what our implementation of PP aims to do. We report 1650 blocks because this is the total number of blocks available on your GPU. However, this gets divided into 2 KV cache sections to support multiple request streams at the same time. This is what allows us to have pipelining of request streams at the same time with load balancing of the two streams. This reporting could possibly be improved to reflect this.
The situation you are talking about (very long prompt) is possible with some changes. You can submit a feature request for the same if there's a very clear use case for it. However, it wasn't a main focus until now since basically only 1 of those very long prompts could be resident in the cache at a time. This would mean essentially no pipelining, and you might want to see if tensor parallelism serves your use case better.
Your current environment
š Describe the bug
Using
--pipeline_parallel_size=2
, it will throw an error if the prompt uses more than half the available tokens.vLLM reports a capacity of 1650 blocks / 26.4k tokens when loading the model,
--max_model_len
was set to 24000:But when sending any prompt with >13k tokens, it throws an
input prompt is too long
error:Adding these print statements:
Shows that the check is done against 825 blocks (13.2k tokens), which is half of the reported capacity at the start, and this request fails with a 21.8k prompt: