What do you think about integrating packing inference?

timothylimyl commented 10 months ago

I see that vLLM does continuous batching, I wonder whether can we incorporate packing into continuous batching.

The idea off the top of my head is using the user defined maximum sequence length and maximum token, we can actually concat/pack the input tokens together (in the same continuous fashion).

Given:

Amount of input sequences to pack (amt_input_seq)
Maximum sequence length (max_seq_len)
Maximum tokens to generate (max_tokens)
Total input tokens (total_input_tokens), where total_input_tokens is the total number of tokens from packing a number of different input sequences (amt_input_seq) together

Condition to optimise for packing: ( amt_input_seq * max_tokens) + total_input_tokens <= max_seq_len

Basically, instead of continuous batch and pad, we continuous batch and pack to fully utilised the gpu. What do you think?

irasin commented 10 months ago

Old vllm does support packing in continuous batching, but after v0.2.2, padding is needed to process different length of prompts, refer to https://github.com/vllm-project/vllm/issues/1985.

Lvjinhong commented 10 months ago

Old vllm does support packing in continuous batching, but after v0.2.2, padding is needed to process different length of prompts, refer to #1985.

So if bs=1, there is no need to consider this point. Additionally, are there any custom optimization directions for scheduling with respect to the llama model?

KaiQiangSong commented 7 months ago

Old vllm does support packing in continuous batching, but after v0.2.2, padding is needed to process different length of prompts, refer to #1985.

Do We have a benchmark comparison between versions?

vllm-project / vllm

What do you think about integrating packing inference? #2121