Closed lambda7xx closed 1 year ago
For offline inference, you can set the max batch size using max_num_batched_tokens
or max_num_seqs
. This parameter can be passed in both Engine
or LLM
class.
https://github.com/vllm-project/vllm/blob/1a2bbc930135cd3b94fbff2aafbdf5c568acc8bd/vllm/engine/arg_utils.py#L28-L29
These are two different parameters. num_batched_tokens
dictates how many tokens per forward pass, each sequence can have multiple token running at the same time (for example during the pre-fill stage). The sequence size is probably what you are looking for, which is a bit higher level.
I hope this can solve the issue. Feel free to re-open the issue if this is not resolved!
In the below code, the prompts is a list and has 4 prompt, does it means the batch size is 4? @simon-mo
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
and it seems the self.max_model_len is = prompt length + output token length
Yes because the default max sequence is 256 by default.
Question