vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.84k stars 4.69k forks source link

[Performance]: From SequenceGroup-native code to Sequence-native code #7116

Open youkaichao opened 3 months ago

youkaichao commented 3 months ago

Proposal to improve performance

We have two concepts in vLLM:

In order to support diverse sampling algorithms, vLLM currently takes a SequenceGroup-native approach: many functions operate in the SequenceGroup-level, e.g. prepare_input takes in a list of SequenceGroup.

The problem is, many functions in an inference engine, naturally fit into Sequence-level operations. For example, when we talk about the batchsize for decoding, it is the number of Sequence we are running for decoding, not the number of SequenceGroup.

To fill in the gap, there are many code in vLLM, that receives SequenceGroup, and unpack the SequenceGroup into Sequence for further operations. Notably, prepare input:

https://github.com/vllm-project/vllm/blob/825b044863a8e3af82a82a80cd2617486cc829ca/vllm/worker/model_runner.py#L507-L510

This turns out to be very inefficient, makes the code difficult to read/maintain.

To have a rough impression about how inefficient these conversion can be, take a look at https://github.com/vllm-project/vllm/pull/7051 , where simply removing some get_seqs call in SequenceGroup, can lead to 1% end-to-end throughput gain.

Per the discussion in https://github.com/vllm-project/vllm/issues/6226 , we will not directly drop beam search support. However, we should figure out a way to support it, without hurting the performance of majority usecase.

The proposal I want to discuss, is to move the vLLM code into a Sequence-native approach. It is inspired by the lightllm approach:

All functions that operate on the Sequence level (mainly the model runner part), will natively receive a list of Sequence. They don't need to unpack SequenceGroup any more.

For some functions that operate on the SequenceGroup level (mainly the scheduler logic for gang-scheduling a sequence group, and the output processor logic that creates/removes sequence in the group), they have to reconstruct the sequence group from given list of sequence, leveraging the global mapping. Note that, an important optimization, is we can skip all the sequence group logic, when we find the global mapping is empty, meaning that we don't have any parallel sampling or beam search.

When we do have parallel sampling or beam search, this will incur some performance drop. However, with the greatly simplified code in the model runner, we can expect the other part of vLLM can be greatly accelerated. Therefore, beam search or parallel sampling can also be faster in the end of the day.

An example benefit, is that this function can be greatly simplified ( we can return early):

https://github.com/vllm-project/vllm/blob/825b044863a8e3af82a82a80cd2617486cc829ca/vllm/engine/output_processor/single_step.py#L82

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`
github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!