vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.76k stars 4.26k forks source link

What do you think about integrating packing inference? #2121

Open timothylimyl opened 10 months ago

timothylimyl commented 10 months ago

I see that vLLM does continuous batching, I wonder whether can we incorporate packing into continuous batching.

The idea off the top of my head is using the user defined maximum sequence length and maximum token, we can actually concat/pack the input tokens together (in the same continuous fashion).

Given:

Condition to optimise for packing: ( amt_input_seq * max_tokens) + total_input_tokens <= max_seq_len

Basically, instead of continuous batch and pad, we continuous batch and pack to fully utilised the gpu. What do you think?

irasin commented 10 months ago

Old vllm does support packing in continuous batching, but after v0.2.2, padding is needed to process different length of prompts, refer to https://github.com/vllm-project/vllm/issues/1985.

Lvjinhong commented 10 months ago

Old vllm does support packing in continuous batching, but after v0.2.2, padding is needed to process different length of prompts, refer to #1985.

So if bs=1, there is no need to consider this point. Additionally, are there any custom optimization directions for scheduling with respect to the llama model?

KaiQiangSong commented 7 months ago

Old vllm does support packing in continuous batching, but after v0.2.2, padding is needed to process different length of prompts, refer to #1985.

Do We have a benchmark comparison between versions?