the streaming output not actually streaming

matatonic / openedai-vision

An OpenAI API compatible API for chat with image input and questions about the images. aka Multimodal.

GNU Affero General Public License v3.0

152 stars 12 forks source link

the streaming output not actually streaming #11

Open lucasjinreal opened 1 month ago

lucasjinreal commented 1 month ago

using openai client to test, the client side is actually receive whole output at last, the server side can print streaming though

any reason for this?

matatonic commented 1 month ago

Which model/backed are you using? Can you also share your client request code?

lucasjinreal commented 1 month ago

the code like this:

response = client.chat.completions.create(
        model="gpt-3.5-turbo-16k",
        messages=[{"role": "user", "content": [
            {'type': 'text', 'text': '描述一下图片内容'},
            {'type': 'image_url', 'image_url': {'url': encode_image_to_base64('data/images/gt_10036.jpg')}},
            ]}],
        temperature=0.95,
        stream=True,
    )
    print_response(response)

it's a typical openai request.

Am used internvl2-8b model.

The phenomenon has always been reproducible stably.

matatonic commented 1 month ago

For streamed replies you need to iterate on the response to get the streamed data, check the chat_with_image.py for a more detailed example, but the response should be used like this, if you want to print it:

for chunk in response:
    print(chunk.choices[0].delta.content, end='', flush=True)

matatonic commented 1 month ago

lucasjinreal commented 1 month ago

Oh, I think I miss leading you, the question is not my client, my client works well with anyother API such as openai etc.

And my print_response is like:

def print_response(response):
    if isinstance(response, openai.Stream):
        for chunk in response:
            if (
                hasattr(chunk.choices[0].delta, "content")
                and chunk.choices[0].finish_reason != "stop"
            ):
                print(chunk.choices[0].delta.content, end="", flush=True)
        print()
    else:
        print(response.choices[0].message.content)

I don't know if is the server issue.

matatonic commented 1 month ago

I've just tested this for sanity sake, and OpenGVLab/InternVL2-8B streams just fine with version 0.28.1 of openedai-vision. Can you test with chat_with_image.py? It would be:

python chat_with_image.py data/images/gt_10036.jpg "描述一下图片内容"

matatonic commented 1 month ago

I wonder if the problem is "if isinstance(response, openai.Stream)", it's possible the response is not structured correctly and may not be seen as a stream.

Update: no, I just tested, it is type: <class 'openai.Stream'>...

matatonic commented 1 month ago

I don't know what could be wrong so far, are you using the docker image? Which client version of the python-openai code are you using?

lucasjinreal commented 1 month ago

Thank u for the digging, the client works OK with openai's API and one of my vllm streaming server, so am afraid it was caused by some thread stuck on server.

I have encounter a similar situation where I inference with LLM in streaming, I resolve it by find I lack of a async in my stream prediction function.

But after I test and verify the code by yours, it's looks like flawless.

The phenomon on my side currently is:

client can not prints output in stream, which works well with other server;
server side I tested the function only print, it's streaming indeed one by one.

So still, I am doubt it might caused by fastapi or sse issue in asynchorinzation.

not the client, not the model generate.

matatonic commented 1 month ago

Ugh, threading in python is a nightmare... You could try the newer AsyncClient from openai, with httpx? I don't know much about this but it seems preferred.

See a sample here: https://github.com/openai/openai-python/blob/195c05a64d39c87b2dfdf1eca2d339597f1fce03/examples/streaming.py#L33

matatonic commented 1 month ago

Did you confirm the code is receiving data as a stream and just printing all at once?

matatonic commented 1 month ago

I would be interested to see the .to_json() dumps of the two different streams, so I can compare. I have built everything from reverse engineering, so I don't even test with openai API directly, just openai client. I have not ruled out a server side issue.

lucasjinreal commented 1 month ago

Did you confirm the code is receiving data as a stream and just printing all at once?

Let me detailed test again to answer this.

matatonic commented 1 month ago

Any update or should I close this report?

lucasjinreal commented 1 month ago

Hello, please wait for my debug information. Currently, I am employing non-stream directly. However, the stream functionality should be tested ultimately.

matatonic commented 2 weeks ago

Hi, any update? Otherwise I think this is getting closed.

lucasjinreal commented 2 weeks ago

Hi, have been busy these days. The root still not precisely investigated.

matatonic commented 1 day ago

Hi, I'm coming back around to this again - any chance you've been able to debug it?