vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.29k stars 4.38k forks source link

vLLM is 4x faster than HF for offline inference #216

Closed flyman3046 closed 1 year ago

flyman3046 commented 1 year ago

Thanks for the great project.

I gave a try and compared with hf's offline inference speed on 100 alpaca examples. The hardware I used is a single v100-40G GPU. Here is my script for vLLM:

sampling_params = SamplingParams(temperature=0.1, top_p=0.75, top_k=40, max_tokens=128, ignore_eos=True)
llm = LLM(model="openlm-research/open_llama_13b")
# Prepare dataset.
start_time = time.time()
for data in my_dataset:
    # Set ignore_eos to True so it generates max_tokens.
    llm.generate(data, sampling_params, ignore_eos=True)
end_time = time.time()

and for hf:

model = LlamaForCausalLM.from_pretrained("openlm-research/open_llama_7b")
tokenizer = ...
generation_config = GenerationConfig(temperature=temperature, top_p=top_p, top_k=top_k, ...)
# Prepare dataset.
start_time = time.time()
for data in my_dataset:
    input_ids = tokenizer(data)["input_ids"]
    model.generate(input_ids, generation_config, max_new_tokens=128)
end_time = time.time()
API Model Size Time (minutes)
HF 7B 12.7
vLLM 7B 3.1
HF 13B 15.8
vLLM 13B 5.3

It seems that the speedup is ~3-4x (not 25x). Am I missing a special setup for vLLM? Thanks.

WoosukKwon commented 1 year ago

Hi @flyman3046, thanks for trying out vLLM! Could you try this

llm.generate(my_dataaset, sampling_params, ignore_eos=True)

instead of the for loop? In fact, the LLM class internally maintains a queue of input sequences and automatically batches the sequences whenever a sequence is finished. This is one of the factors that make vLLM significantly faster than HF. Please try this out!

flyman3046 commented 1 year ago

Gave it another try with llm.generate(my_dataaset, ...) and it indeed speeds up by quite a lot from 180 seconds to 16 seconds.

A follow-up question: how much the speedup is due to batching? how much due to other improvements? Is it a fair comparison if hf does not use batch whereas vLLM does? Thanks again!

WoosukKwon commented 1 year ago

@flyman3046 Thanks for sharing your experience! We use a more sophisticated batching mechanism than the traditional batching mechanism. In short, vLLM does not wait until all the sequences in a batch finish, but packs incoming sequences whenever a sequence in the batch finishes. This leads to 3x-10x throughput improvement in our experience. To implement this on top of HF, you need to re-write the model code and develop some special CUDA kernels, which vLLM did.

zhaozhixin commented 1 year ago

@WoosukKwon another quesion, please help me. in every loop if print(llm.generate(data, sampling_params, ignore_eos=True)), I can get the answer immediately, so I think this means the generation is not batched among the loops. Is there something wrong about this?