microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.76k stars 163 forks source link

Enable streaming option in the OpenAI API server #480

Open adk9 opened 1 month ago

adk9 commented 1 month ago

Now that token streaming support has merged (#397), we can enable streaming response in the OpenAI RESTful API endpoint.

This PR

Running the Server

python -m mii.entrypoints.openai_api_server \
    --model "mistralai/Mistral-7B-Instruct-v0.1" \
    --port 3000 \
    --host 0.0.0.0

Client

from openai import OpenAI

client = OpenAI(
    base_url="http://ip:port/v1",
    api_key="test",
)

completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-v0.1",
    messages=[
        {
            "role": "user",
            "content": "Tell me a joke.",
        },
    ],
    max_tokens=1024,
    stream=True
)

for chunk in completion:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")