Closed flyman3046 closed 1 year ago
Hi @flyman3046, thanks for trying out vLLM! Could you try this
llm.generate(my_dataaset, sampling_params, ignore_eos=True)
instead of the for loop? In fact, the LLM
class internally maintains a queue of input sequences and automatically batches the sequences whenever a sequence is finished. This is one of the factors that make vLLM significantly faster than HF. Please try this out!
Gave it another try with llm.generate(my_dataaset, ...)
and it indeed speeds up by quite a lot from 180 seconds to 16 seconds.
A follow-up question: how much the speedup is due to batching? how much due to other improvements? Is it a fair comparison if hf does not use batch whereas vLLM does? Thanks again!
@flyman3046 Thanks for sharing your experience! We use a more sophisticated batching mechanism than the traditional batching mechanism. In short, vLLM does not wait until all the sequences in a batch finish, but packs incoming sequences whenever a sequence in the batch finishes. This leads to 3x-10x throughput improvement in our experience. To implement this on top of HF, you need to re-write the model code and develop some special CUDA kernels, which vLLM did.
@WoosukKwon another quesion, please help me. in every loop if print(llm.generate(data, sampling_params, ignore_eos=True)), I can get the answer immediately, so I think this means the generation is not batched among the loops. Is there something wrong about this?
Thanks for the great project.
I gave a try and compared with hf's offline inference speed on 100 alpaca examples. The hardware I used is a single v100-40G GPU. Here is my script for vLLM:
and for hf:
It seems that the speedup is ~3-4x (not 25x). Am I missing a special setup for vLLM? Thanks.