[Feature]: Batch inference for `llm.chat()` API - Githubissues

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

29.96k stars 4.52k forks source link

[Feature]: Batch inference for `llm.chat()` API #8481

Closed ywang96 closed 1 month ago

ywang96 commented 1 month ago

🚀 The feature, motivation and pitch

Currently llm.chat() API only supports one conversation per inference. This means we cannot use this API to fully leverage vLLM for efficient offline processing.

Alternatives

No response

Additional context

Implementation should be rather straightforward:

at API level, llm.chat() should also accept a list of conversations.
When llm.chat() is invoked, the list of conversations will be parsed into list of prompts, and all multimodal data items will be retrieved and loaded into their corresponding format that llm.generate() accepts.
Send the list of {prompt: xxx, multi_modal_data: xxx} to the llm.generate()

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

aandyw commented 1 month ago

I can work on this :)