vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.25k stars 4.58k forks source link

[Feature]: Online Inference on local model with OpenAI Python SDK #8631

Open pesc101 opened 1 month ago

pesc101 commented 1 month ago

🚀 The feature, motivation and pitch

OpenAI recently provided a new endpoint batch inference (https://platform.openai.com/docs/guides/batch/overview?lang=curl). It would be nice if it works using the batch format from OpenAI but with a local model. I created an usage Issue for that before (https://github.com/vllm-project/vllm/issues/8567)

Something like that:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

batch_input_file = client.files.create(
  file=open("batchinput.jsonl", "rb"),
  purpose="batch"
)

client.batches.create(
    input_file_id= batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={
      "description": "nightly eval job"
    }
)

At the moment there will be an error: NotFoundError: Error code: 404 - {'detail': 'Not Found'}

Advantages for the implementation:

Alternatives

Internal Implementation: There was a feature implemented using python -m vllm.entrypoints.openai_batch as described here (https://github.com/vllm-project/vllm/issues/4777), but that is not compatible with the OpenAI SDK and also not compatible with the docker setup.

Additional context

No response

Before submitting a new issue...

pesc101 commented 1 month ago

Seems to me here are some bots with suspect links 👀

DarkLight1337 commented 1 month ago

cc @wuisawesome @pooyadavoodi since you two have worked on batch API

wuisawesome commented 1 month ago

but that is not compatible with the OpenAI SDK and also not compatible with the docker setup

Ooc can you say more about your docker setup? Would it unblock you to mount the directory with data into your docker container?

Fwiw, the reason I didn't implement this API originally was that I couldn't think of a way to implement the job management without either introducing foot-guns or the first stateful endpoint.

This is not the first time we've heard this request though, and it is probably worth thinking more about if it becomes a recurring theme.

pesc101 commented 1 month ago

Hey, you can find my docker setup here: https://github.com/vllm-project/vllm/issues/8567. I have mounted the directory into the docker.

Okay, I see I don't know exactly how to implement it, but I think it would improve the usability of vllm in general. I mean, a general advantage of vllm is to use fast batch inference and I think it would be nice if it is compatible with the OpenAI SDK.

mbuet2ner commented 1 month ago

Totally agree that this is a interesting feature. Would be super nice to have something standardized here (mentioned something similar here). We are currently leveraging Ray and the LLMClass llm.chat() for that. Really simliar to the very simple generate() example from the docs. v.06.2 even brought support for batch inference for the llm.chat().

Our current approach is as follows:

I can share some WIP code on how to parse the JSONL files with Ray, loading it as BatchRequestInput and formatting the llm.chat() output as BatchRequestOutput if you want. It is a little bit complicated due to the different interfaces but it works!