vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.96k stars 4.52k forks source link

[Feature]: Batch inference for `llm.chat()` API #8481

Closed ywang96 closed 1 month ago

ywang96 commented 1 month ago

🚀 The feature, motivation and pitch

Currently llm.chat() API only supports one conversation per inference. This means we cannot use this API to fully leverage vLLM for efficient offline processing.

Alternatives

No response

Additional context

Implementation should be rather straightforward:

  1. at API level, llm.chat() should also accept a list of conversations.
  2. When llm.chat() is invoked, the list of conversations will be parsed into list of prompts, and all multimodal data items will be retrieved and loaded into their corresponding format that llm.generate() accepts.
  3. Send the list of {prompt: xxx, multi_modal_data: xxx} to the llm.generate()

Before submitting a new issue...

aandyw commented 1 month ago

I can work on this :)