vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.67k stars 4.48k forks source link

How do I set the batch size for vLLM #1576

Closed lambda7xx closed 1 year ago

lambda7xx commented 1 year ago

Question

simon-mo commented 1 year ago

For offline inference, you can set the max batch size using max_num_batched_tokens or max_num_seqs . This parameter can be passed in both Engine or LLM class. https://github.com/vllm-project/vllm/blob/1a2bbc930135cd3b94fbff2aafbdf5c568acc8bd/vllm/engine/arg_utils.py#L28-L29

https://github.com/vllm-project/vllm/blob/1a2bbc930135cd3b94fbff2aafbdf5c568acc8bd/vllm/engine/arg_utils.py#L151-L159

These are two different parameters. num_batched_tokens dictates how many tokens per forward pass, each sequence can have multiple token running at the same time (for example during the pre-fill stage). The sequence size is probably what you are looking for, which is a bit higher level.

I hope this can solve the issue. Feel free to re-open the issue if this is not resolved!

lambda7xx commented 1 year ago

In the below code, the prompts is a list and has 4 prompt, does it means the batch size is 4? @simon-mo

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

and it seems the self.max_model_len is = prompt length + output token length

simon-mo commented 1 year ago

Yes because the default max sequence is 256 by default.