Add support for streaming API

katanemo / arch

Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with APIs - all outside business logic. Built by the core contributors of Envoy proxy, on Envoy.

https://archgw.com

Apache License 2.0

367 stars 19 forks source link

Add support for streaming API #36

Open adilhafeez opened 2 months ago

adilhafeez commented 2 months ago

By default all the requests made to LLMs (whether API based or open-source) entire completion response is generated before sending response to client. This creates bad user experience.

Enter Streaming API. With streaming API client will receive updates for as model is generating output tokens. This improves user experience greatly but adds significant load on network.

junr03 commented 1 month ago

I believe there is a misunderstanding somewhere here.

Let me see if I understand the problem statement correctly: is the problem that you see that a client that is using the gateway will only receive bytes from the upstream LLM API once the LLM API fully send an HTTP response?

If that is the problem statement, then that is not the case. As long as Envoy has prior knowledge that the LLM API is capable of handling http 2, then Envoy will stream response bytes back to the client. That is, as long as none of the installed filters holds response bytes until completion -- which as of right now the gateway filter does not.

On the other side: the request API is currently not streaming bytes up to the LLM API given that AFAIK the local routing heuristics need a full request body before making a routing decision. Is that not the case?

adilhafeez commented 1 month ago

is the problem that you see that a client that is using the gateway will only receive bytes from the upstream LLM API once the LLM API fully send an HTTP response

Yes

However it's not http-2 or http-1 issue. It's how OpenAI chose to implement streaming. LLMs generate text one token at a time (by taking into account all tokens LLM has generated so far). Without streaming support OpenAI (and all other SDKs and models) wait for all the text to be generated before sending response back to client.

With streaming enabled LLMs will send a response for every single token.

For example take a look at this code (from this link)

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

junr03 commented 1 month ago

Alright, this makes more sense now. It looks like: a) The request has to authorize the server to stream b) The client has to be able to parse server events in order to correctly stream response tokens.

In that case my next question to you is: should the gateway handle the parsing of the streaming chunks, or should the gateway stream http response bytes to the Katanemo client, and let the client parse the streamed tokens using a well maintained library.

adilhafeez commented 1 month ago

a) The request has to authorize the server to stream

To start streaming response the developer has to configure the client with stream=true. When streaming is set, envoy will expect LLM to send a response for every single token it generated that will be sent back to developer. This will continue until LLM has generated all tokens or it has hit ratelimit.

b) The client has to be able to parse server events in order to correctly stream response tokens.

Yes, developer will be aware of the response they are receiving as they would've configured the client with stream=true

In that case my next question to you is: should the gateway handle the parsing of the streaming chunks, or should the gateway stream http response bytes to the Katanemo client, and let the client parse the streamed tokens using a well maintained library.

I think simplest implementation would be to stream the response from LLM back to developer. Client would understand and parse the response.

adilhafeez commented 1 week ago

Needs verification