Closed JH-lee95 closed 6 months ago
TLDR: you will need to make calls to AsyncLLMEngine
via multithreading
For reference, here's the sglang implementation of parallelism, which is essentially making multiple async calls to the endpoint.
As @ywang96 mentioned, you can use LLMEngine to achieve this right now, see example https://github.com/vllm-project/vllm/blob/main/examples/llm_engine_example.py#L34
We current do not support this in LLM
offline inference wrapper. But it's a good issue to work on!
@ywang96 @simon-mo Thank you! I hope this feature to be integrated soon!
For this feature request, concretely, it would mean changing the generate function in LLM
class https://github.com/vllm-project/vllm/blob/03d37f24413b13a4e42ee115f89f647c441d1fcd/vllm/entrypoints/llm.py#L124-L125 to support sampling_params: Optional[Union[SamplingParams, List[SamplingParams]]
When the sampling_params
is None, we should use the default. When it is a single value, it should be applied to every prompt. When it is a list, the list must have same length as the prompts and it is paired one by one with the prompt.
I'll be working on this task!
Hi,
sglang support parallelism link.
Like the example in the link, can I call the API with different sampling parameters in parallel?
for example
if I have a batched data ;
I want to set temperature as 1.0 for data[0] and 0.7 for data[1] and 0.0 for data[2] then call them simultaneously.