timoklimmer / powerproxy-aoai

Monitors and processes traffic to and from Azure OpenAI endpoints.
MIT License
83 stars 23 forks source link

feature idea: client stream request -> proxy switch to batch -> AOAI -> proxy switch to stream response -> client #73

Closed krahnikblis closed 2 months ago

krahnikblis commented 2 months ago

i'm going to start figuring this out, unless y'all are already working on such a feature and i can leave it to the experts? the idea is to avoid some of the issues we have in streaming latency by switching to one-shot at the proxy server.

our instance of AOAI breaks up streams into batches and runs them through the content filter (which i have no control over) - each small batch of that can take up to 2 minutes, so a 500-token streaming round-trip can take up to 20 minutes, whereas the batch mode goes through the filter only once and thus usually never going over a minute or two to handle a whole request.

easy answer: "just don't use stream" but unfortunately the various plugins for VS Code we're trying don't allow for this configuration - they're hard-coded to use the streaming mode. so, i'd like to have a configuration (ideally at the client level) to hijack stream requests and send them as one-shots. but i suspect the client side response is expected to be 'application/stream' or whatever, thus the proxy would need to feed everything back in that format else the client would error due to receiving the single json? maybe this last conversion back to stream isn't necessary, i haven't started messing with it yet.

at any rate, i'd love a feature like this, or your thoughts on how to go about contributing to it. thanks!

timoklimmer commented 2 months ago

Hi @krahnikblis, if I understand you correctly, you want to save content filter overhead latency by converting streaming requests to non-streaming requests. While this would probably be possible from a technical perspective, I think it would give only limited benefits because the time-to-first-token would significantly increase and with that, user experience should not really improve. I suggest to take a look at the documentation here. There are several options to consider like reconfiguring the content filter, enabling asynchronous filtering or even opting out of the content filtering.

timoklimmer commented 2 months ago

Closing issue

krahnikblis commented 1 month ago

@timoklimmer i understand the prioritization of such a request and how it mightn't align to your priorities for the project given the difficulty in building it.

in response to I think it would give only limited benefits because the time-to-first-token would significantly increase and with that, user experience should not really improve., it's a significant difference - we've had up to 20 minutes for a streaming response containing only ~500 tokens, whereas a non-streaming one for the same response size we've only had a max of 2 minutes or so (i.e., it's a big enough issue that i'm trying to solve or work around it). if you're interested in more info on how others are encountering this feature of Azure, see this related issue, especially the comment from March 14th demonstrating the delays in chunking of stream to content filter: https://github.com/vercel/ai/issues/1066

it's probably more an "Azure OAI is still in its infancy and working out the kinks" kind of thing which will resolve itself at some point anyway... thanks for taking a look, and for this cool project!

timoklimmer commented 1 month ago

Hi @krahnikblis, thanks for coming back on this and sharing the link to the other thread. Yes, if the content filter overhead is reduced, the total latency overall goes down, but I was thinking of the time-to-first token. If the prompt needs to be processed completely first before any token can be returned (which I think is the case in your proposed approach), it can well be that the overall latency will go down indeed, but users will also have to wait longer until they see a first reaction to their prompt. This effect might be not too much of an issue with 500 tokens, but becomes an issue the more tokens are to be generated (for example documents). It probably depends on the use case if your approach would work or not.

Unfortunately and indeed, I need to prioritize. Morphing a non streaming response back to a streaming response is not trivial. It could also be that the streaming response from AOAI includes information that is not reproducible from the non-streaming response, so you would miss that, essentially meaning you cannot keep the protocol. In addition, you will need to retokenize the response etc. If you want to give it a try, feel free to do so, but my time to support on this is limited.

If possible/affordable, I would consider using Provisioned Throughput Units/PTUs first (due to its faster and more constant latencies), then reconfiguring/using async filtering or opting out of the content filter. Besides, unless it was a giant prompt, 20 minutes for a streaming response sounds unusually long - not sure if this would be the usual response time.