Open pesc101 opened 1 month ago
Seems to me here are some bots with suspect links 👀
cc @wuisawesome @pooyadavoodi since you two have worked on batch API
but that is not compatible with the OpenAI SDK and also not compatible with the docker setup
Ooc can you say more about your docker setup? Would it unblock you to mount the directory with data into your docker container?
Fwiw, the reason I didn't implement this API originally was that I couldn't think of a way to implement the job management without either introducing foot-guns or the first stateful endpoint.
This is not the first time we've heard this request though, and it is probably worth thinking more about if it becomes a recurring theme.
Hey, you can find my docker setup here: https://github.com/vllm-project/vllm/issues/8567. I have mounted the directory into the docker.
Okay, I see I don't know exactly how to implement it, but I think it would improve the usability of vllm in general. I mean, a general advantage of vllm is to use fast batch inference and I think it would be nice if it is compatible with the OpenAI SDK.
Totally agree that this is a interesting feature. Would be super nice to have something standardized here (mentioned something similar here).
We are currently leveraging Ray and the LLMClass llm.chat()
for that. Really simliar to the very simple generate()
example from the docs. v.06.2 even brought support for batch inference for the llm.chat()
.
Our current approach is as follows:
batch.jsonl
files and writes them to a blob storage. You can find the official OpenAI API specification here and can use the OpenAI Pydantic models from the SDK (which are auto-generated from the API spec) to build your endpoints and validate the data.LLMClass
. The llm.chat()
has a slightly different interface but the messages
format and the sampling parameters
are more or less identical to the OpenAI format. Then you can take the Pydantic models from the existing batch entrypoint parse the JSONL files, extract the messages
, sampling parameters
etc. and give it to llm.chat()
. After that you can take the RequestOutput
from llm.chat()
and iteratively build the BatchRequestOutput and the intermediate OpenAI/vllm-adapted OpenAI Pydantic models. I can share some WIP code on how to parse the JSONL files with Ray, loading it as BatchRequestInput
and formatting the llm.chat()
output as BatchRequestOutput
if you want. It is a little bit complicated due to the different interfaces but it works!
🚀 The feature, motivation and pitch
OpenAI recently provided a new endpoint batch inference (https://platform.openai.com/docs/guides/batch/overview?lang=curl). It would be nice if it works using the batch format from OpenAI but with a local model. I created an usage Issue for that before (https://github.com/vllm-project/vllm/issues/8567)
Something like that:
At the moment there will be an error:
NotFoundError: Error code: 404 - {'detail': 'Not Found'}
Advantages for the implementation:
Alternatives
Internal Implementation: There was a feature implemented using
python -m vllm.entrypoints.openai_batch
as described here (https://github.com/vllm-project/vllm/issues/4777), but that is not compatible with the OpenAI SDK and also not compatible with the docker setup.Additional context
No response
Before submitting a new issue...