Open youkaichao opened 3 months ago
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Proposal to improve performance
We have two concepts in vLLM:
n
. In beam search, sequences in the sequence group can change, grow, die.In order to support diverse sampling algorithms, vLLM currently takes a SequenceGroup-native approach: many functions operate in the SequenceGroup-level, e.g.
prepare_input
takes in a list ofSequenceGroup
.The problem is, many functions in an inference engine, naturally fit into Sequence-level operations. For example, when we talk about the batchsize for decoding, it is the number of Sequence we are running for decoding, not the number of SequenceGroup.
To fill in the gap, there are many code in vLLM, that receives SequenceGroup, and unpack the SequenceGroup into Sequence for further operations. Notably, prepare input:
https://github.com/vllm-project/vllm/blob/825b044863a8e3af82a82a80cd2617486cc829ca/vllm/worker/model_runner.py#L507-L510
This turns out to be very inefficient, makes the code difficult to read/maintain.
To have a rough impression about how inefficient these conversion can be, take a look at https://github.com/vllm-project/vllm/pull/7051 , where simply removing some
get_seqs
call inSequenceGroup
, can lead to 1% end-to-end throughput gain.Per the discussion in https://github.com/vllm-project/vllm/issues/6226 , we will not directly drop beam search support. However, we should figure out a way to support it, without hurting the performance of majority usecase.
The proposal I want to discuss, is to move the vLLM code into a Sequence-native approach. It is inspired by the lightllm approach:
Dict[int, List[int]]
, maps the sequence group id to the ids of sequences inside the group, only for a sequence group with parallel sampling or beam searchAll functions that operate on the Sequence level (mainly the model runner part), will natively receive a list of Sequence. They don't need to unpack
SequenceGroup
any more.For some functions that operate on the SequenceGroup level (mainly the scheduler logic for gang-scheduling a sequence group, and the output processor logic that creates/removes sequence in the group), they have to reconstruct the sequence group from given list of sequence, leveraging the global mapping. Note that, an important optimization, is we can skip all the sequence group logic, when we find the global mapping is empty, meaning that we don't have any parallel sampling or beam search.
When we do have parallel sampling or beam search, this will incur some performance drop. However, with the greatly simplified code in the model runner, we can expect the other part of vLLM can be greatly accelerated. Therefore, beam search or parallel sampling can also be faster in the end of the day.
An example benefit, is that this function can be greatly simplified ( we can return early):
https://github.com/vllm-project/vllm/blob/825b044863a8e3af82a82a80cd2617486cc829ca/vllm/engine/output_processor/single_step.py#L82
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)