vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.63k stars 4.47k forks source link

Support multiple sampling params in `LLM` API #3313

Closed JH-lee95 closed 6 months ago

JH-lee95 commented 7 months ago

Hi,

sglang support parallelism link.

Like the example in the link, can I call the API with different sampling parameters in parallel?

for example

if I have a batched data ;

data=["hi","hello","i'm your assistant"]

I want to set temperature as 1.0 for data[0] and 0.7 for data[1] and 0.0 for data[2] then call them simultaneously.

ywang96 commented 7 months ago

TLDR: you will need to make calls to AsyncLLMEngine via multithreading

For reference, here's the sglang implementation of parallelism, which is essentially making multiple async calls to the endpoint.

simon-mo commented 7 months ago

As @ywang96 mentioned, you can use LLMEngine to achieve this right now, see example https://github.com/vllm-project/vllm/blob/main/examples/llm_engine_example.py#L34

We current do not support this in LLM offline inference wrapper. But it's a good issue to work on!

JH-lee95 commented 7 months ago

@ywang96 @simon-mo Thank you! I hope this feature to be integrated soon!

simon-mo commented 7 months ago

For this feature request, concretely, it would mean changing the generate function in LLM class https://github.com/vllm-project/vllm/blob/03d37f24413b13a4e42ee115f89f647c441d1fcd/vllm/entrypoints/llm.py#L124-L125 to support sampling_params: Optional[Union[SamplingParams, List[SamplingParams]]

When the sampling_params is None, we should use the default. When it is a single value, it should be applied to every prompt. When it is a list, the list must have same length as the prompts and it is paired one by one with the prompt.

nunjunj commented 7 months ago

I'll be working on this task!