Open timothylimyl opened 10 months ago
Old vllm does support packing in continuous batching, but after v0.2.2, padding is needed to process different length of prompts, refer to https://github.com/vllm-project/vllm/issues/1985.
Old vllm does support packing in continuous batching, but after v0.2.2, padding is needed to process different length of prompts, refer to #1985.
So if bs=1, there is no need to consider this point. Additionally, are there any custom optimization directions for scheduling with respect to the llama model?
Old vllm does support packing in continuous batching, but after v0.2.2, padding is needed to process different length of prompts, refer to #1985.
Do We have a benchmark comparison between versions?
I see that vLLM does continuous batching, I wonder whether can we incorporate packing into continuous batching.
The idea off the top of my head is using the user defined maximum sequence length and maximum token, we can actually concat/pack the input tokens together (in the same continuous fashion).
Given:
amt_input_seq
)max_seq_len
)max_tokens
)total_input_tokens
), wheretotal_input_tokens
is the total number of tokens from packing a number of different input sequences (amt_input_seq
) togetherCondition to optimise for packing: (
amt_input_seq
*max_tokens
) +total_input_tokens
<=max_seq_len
Basically, instead of continuous batch and pad, we continuous batch and pack to fully utilised the gpu. What do you think?