vocodedev / vocode-core

🤖 Build voice-based LLM agents. Modular + open source.
https://vocode.dev
MIT License
2.93k stars 494 forks source link

[EPD-458] Openai completions stopping #319

Closed cammoore54 closed 1 year ago

cammoore54 commented 1 year ago

After making minimal changes to the chat.py example to tailor it for a golf booking chatbot flow, the openai completions stop consistently.

The below has been repeatable many times. When the conversation reaches this point, the system responds Thank you for letting me know. but doesn't send the following sentence asking the user another question.

AI: Hello, I'm Tom from the golf course. How may I help you?
Human: hey i want to book comp
Human: DEBUG:__main__:Responding to transcription
AI: Sure, I can help you with that.
AI: Are you a member of our club?
yep
Human: DEBUG:__main__:Responding to transcription
AI: Great!
AI: Could you please provide me with your member number?
12345
Human: DEBUG:__main__:Responding to transcription
AI: Thank you for providing your member number.
AI: May I have your name, please?
cam
Human: DEBUG:__main__:Responding to transcription
AI: Thank you, Cam.
AI: How many players will be participating in the competition?
2
Human: DEBUG:__main__:Responding to transcription
AI: Thank you for letting me know.

Human: DEBUG:__main__:Responding to transcription
ERROR:asyncio:Unclosed connection
client_connection: Connection<ConnectionKey(host='api.openai.com', port=443, is_ssl=True, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=None)>

The response stops, then after I send an empty message to the system I receive the above asyncio error and the system continues as normal.

EDIT: I should state that the completion stops at Thank you for letting me know. 100% of the time but the asyncio error only happens occasionally.

EPD-458

cammoore54 commented 1 year ago

Update I have tested on both my intel macbook pro and an ubuntu VM and experiencing the same behaviour. I believe it could be to do with openai's async completion or with how it is being parsed in chat_gpt_agent.py

rjheeta commented 1 year ago

+1 I'm experiencing this too.

My "workaround" is including specific instruction in the prompt "to always move the conversation forward" / "ask the next question" / etc. but I don't think this is a permanent solution, and it's makes the prompt larger than it needs to be.

Oddly (and if it helps), if I specifically respond with something like, OK? And now what? It will continue as normal.

Where do you suspect this problem is?

cammoore54 commented 1 year ago

When I test with the same prompt in openai's playground, it always follows up with a question. This makes me think it is an error openai's API or with the way vocode is unpacking the tokens from the openai stream.

Screenshot 2023-07-31 at 8 18 09 am
tballenger commented 1 year ago

I second this. This happens to us as well. It behaves differently for us "correctly" when using the chatgpt playground, always going to the next question. The work around for me is to use generate_responses=False (which does NOT use streaming) vs generate_responses=True (which uses streaming) in the agent config

Curious to know if that helps? Not sure about a longer term / better solution here...

Kian1354 commented 1 year ago

are you using 3.5 in the playground as well? it's likely just a prompting issue / model issue

you can test the behavior by putting the exact transcript/prompt used with vocode in the playground (including interruptions, etc.)

i don't think there's an "issue" per se with how we're using the chat api

tballenger commented 1 year ago

@Kian1354 yep we have been using the exact prompt in vocode and in the chatgpt playground (using 3.5). Here's a loom of a side by side --https://www.loom.com/share/27dcfe59f20345658b00d5d4b94ca560 (the chat continues to ask the 4 questions, the voice agent stops after 2 unless i 'remind' it to continue as mentioned by @cammoore54 (you can see the log in the terminal on the right since you can't hear the audio through my headphones))

cammoore54 commented 1 year ago

@Kian1354 After digging a little deeper I believe it is a bug with openai's python streaming library. The response from openai sends a stop event at the end of the sentence indicating that the completion is finished.

@tballenger I experience the same scenario. What's interesting is that in the playground they are streaming the response as well, but the response is different.

Output from openai_get_tokens in vocode/streaming/agent/utils.py

{
  "id": "chatcmpl-7iWTsDgi96dVoqjvfpxlb9HktUBJj",
  "object": "chat.completion.chunk",
  "created": 1690845644,
  "model": "gpt-3.5-turbo-16k-0613",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": " know"
      },
      "finish_reason": null
    }
  ]
}
 know
{
  "id": "chatcmpl-7iWTsDgi96dVoqjvfpxlb9HktUBJj",
  "object": "chat.completion.chunk",
  "created": 1690845644,
  "model": "gpt-3.5-turbo-16k-0613",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "."
      },
      "finish_reason": null
    }
  ]
}
.
{
  "id": "chatcmpl-7iWTsDgi96dVoqjvfpxlb9HktUBJj",
  "object": "chat.completion.chunk",
  "created": 1690845644,
  "model": "gpt-3.5-turbo-16k-0613",
  "choices": [
    {
      "index": 0,
      "delta": {},
      "finish_reason": "stop"
    }
  ]
}
AI: Thank you for letting me know.

I also tried changing the completion call to openai.ChatCompletion.create(**chat_parameters) to remove the likelihood it is coming from the async call and the issue persists.

I've raised an issue in the openai-python git here

tballenger commented 1 year ago

thanks for the update @cammoore54 and nice find! let's hope they address the bug quickly

cammoore54 commented 1 year ago

No worries @tballenger! Can you you comment on that ticket to help it gain momentum

tballenger commented 1 year ago

yep absolutely

rjheeta commented 1 year ago

I've commented on it too!

@tballenger Do you find your workaround to be sufficient in the interim or any known negative implications from unsetting that flag? My initial tests seem to be OK, but I want to make sure I'm not missing anything obvious

Thanks

tballenger commented 1 year ago

@rjheeta i believe setting that flag bypasses the streaming functionality, so it seems a little slower from our tests, plus we don't think it supports functions, which are what the actions use. So only considering as a short-term workaround, we hope the streaming python library gets fixed quickly!

bjquinn commented 1 year ago

+1 I'm getting this too. I'm confident it's not a GPT3.5 vs 4 difference, as I'm using 4 with vocode. Before I discovered vocode, I had an implementation of an ASR+Agent+TTS application I'd written myself, and it's very verbose in its responses. But I put the same prompts into vocode and it gets terser and terser as the conversation goes on, finally just returning one sentence answers. I worried that it was the max_tokens parameter in the openai API call, but I commented that out. That may have made it better, but since it's inconsistent behavior, I'm not 100% sure. But I still get lots of one sentence responses later on in the conversation when the exact same original prompt on the exact same model, using streaming completion from openai with my own implementation never devolves into one sentence responses.

cammoore54 commented 1 year ago

@bjquinn can you share how you process openai streaming with your own code?

cammoore54 commented 1 year ago

After doing some more testing, I have been able to get consistent results with streaming using this code:

while True:
    message = input("User : ")
    if message:
        messages.append(
            {"role": "user", "content": message},
        )
        response = openai.ChatCompletion.create(
            messages=messages,
            max_tokens=256,
            temperature=1.0,
            model="gpt-3.5-turbo-16k-0613",
            stream=True
        )

    collected_chunks = []
    collected_messages = []
    # iterate through the stream of events
    for chunk in response:
        print(chunk)
        collected_chunks.append(chunk)  # save the event response
        chunk_message = chunk['choices'][0]['delta']  # extract the message
        collected_messages.append(chunk_message)  # save the message

    # print the time delay and text received
    full_reply_content = ''.join([m.get('content', '') for m in collected_messages])
    print(f"Full conversation received: {full_reply_content}")
    messages.append(
            {"role": "assistant", "content": full_reply_content},
        )

@Kian1354 this is making me think that it may be to do with vocode's async implementation of unpacking the completion call

rjheeta commented 1 year ago

@cammoore54 Wow, great investigations!

Do you know where the corresponding response handling in vocode is? You're certainly more versed than I am, but I can try to take a look to see if I spot anything odd.

Is this it? https://github.com/vocodedev/vocode-python/blob/717514c4905d20a1252b94bf693f0badbff0cbd7/vocode/streaming/agent/chat_gpt_agent.py#L110

cammoore54 commented 1 year ago

Starts here: https://github.com/vocodedev/vocode-python/blob/717514c4905d20a1252b94bf693f0badbff0cbd7/vocode/streaming/agent/base_agent.py#L217C11-L217C11

and calls class ChatGPTAgent's generate_reponse method https://github.com/vocodedev/vocode-python/blob/717514c4905d20a1252b94bf693f0badbff0cbd7/vocode/streaming/agent/chat_gpt_agent.py#L132

bjquinn commented 1 year ago

@bjquinn can you share how you process openai streaming with your own code?

Yes, see below. Let me know if this is what you were asking for:

async for chunk in await openai.ChatCompletion.acreate(
        model=model,
        messages=messages,
        stream=True,
        functions=functions
):
    content = chunk["choices"][0].get("delta", {}).get("content")
    # hacky logic to string together sentences and track ends of sentences here.  for each sentence, add it to "fullcontent"

messages.append({"role": "assistant", "content": fullcontent})

That's really it -- I do have some hacky sentence detection logic in the async for, and I kick off async requests to play.ht once I detect a sentence end, but I don't think any of that would affect whether I get full completions from openai or not.

cammoore54 commented 1 year ago

Thanks @bjquinn.

I have tested in isolation with async implementation using the below code and I get the desired responses 100% of the time (the same as playground). Therefore it has to do with the implementation in vocode.

@ajar98 @Kian1354 Do you have the capacity to look into this? I am happy to support but am still familiarising myself with the codebase

async def generate_response(messages):
    async for chunk in await openai.ChatCompletion.acreate(
            model="gpt-3.5-turbo-16k-0613",
            messages=messages,
            stream=True,
            functions=functions
    ):
        chunk_message = chunk['choices'][0]['delta']
        yield chunk_message

async def handle_convo():
    while True:
        message = input("User : ")
        if message:
            messages.append(
                {"role": "user", "content": message},
            )
            collected_messages = []
            async for item in generate_response(messages):
                print(item)
                collected_messages.append(item)

            full_reply_content = ''.join([m.get('content', '') for m in collected_messages])
            print(f"Full conversation received: {full_reply_content}")
            messages.append(
                    {"role": "assistant", "content": full_reply_content},
                )

asyncio.run(handle_convo())
HHousen commented 1 year ago

I have identified the problem. Vocode splits the OpenAI response on sentences in order to synthesize them as fast as possible. After something is spoken, Vocode adds the utterance to the transcript associated with the ChatGPT Agent. As a result, OpenAI's response gets added to the transcript but split apart by sentence. So, when the user sends another message and this transcript is reformatted and sent back to OpenAI to generate the next message, the previous assistant message is split.

For example, when recreating @cammoore54's example with temperature=0 and the gpt-3.5-turbo-16k-0613 model, this is what is sent to the OpenAI API when the user says "yep":

{'role': 'assistant', 'content': "Hello, I'm Tom from the golf course. How may I help you?"},
{'role': 'user', 'content': 'hey i want to book comp'},
{'role': 'assistant', 'content': 'Sure, I can help you with that.'},
{'role': 'assistant', 'content': 'Are you a member of our golf club?'},
{'role': 'user', 'content': 'yep'},

This is what should be sent (what you put into the OpenAI playground):

{'role': 'assistant', 'content': "Hello, I'm Tom from the golf course. How may I help you?"},
{'role': 'user', 'content': 'hey i want to book comp'},
{'role': 'assistant', 'content': 'Sure, I can help you with that. Are you a member of our golf club?'},
{'role': 'user', 'content': 'yep'},

This difference is the source of the problem. If the previous chat history contains only one sentence responses, then future assistant messages will also only be one sentence.

Good catch finding this bug! The messages should definitely not be split apart when they are sent back to the OpenAI API.

Here are the differences from the OpenAI playground (both with temperature=0 and the gpt-3.5-turbo-16k-0613 model): Screenshot_20230803_184330 Above: Formatted properly, the second sentence is generated.

Screenshot_20230803_183547 Above: Formatted how Vocode currently does it, only one sentence is generated.

So, it seems like the OpenAI playground and the OpenAI python library create the exact same response (testing with temperature=0). Also, setting the stream option or using the async vs sync api doesn't make a difference. The OpenAI Python issue https://github.com/openai/openai-python/issues/555 is probably not an issue after all. Depending on how the messages are formatted, the second sentence is not generated.

We are currently working on a fix for this! Thanks :smile:!

cammoore54 commented 1 year ago

Ah nice find! Thanks @HHousen. Seems so obvious now 😵‍💫.

Well done vocode team, love your product and your support!

bjquinn commented 1 year ago

@HHousen I tried the patch and it looks to work on my end!!

HHousen commented 1 year ago

@HHousen I tried the patch and it looks to work on my end!!

Nice! We're still working on it and might change how its implemented in the next few hours, but good to know that its currently working!

tmancill commented 1 year ago

I have identified the problem. Vocode splits the OpenAI response on sentences in order to synthesize them as fast as possible.

@HHousen Thank you for the explanation here. I believe this also explains an intermittent problem I have been seeing on the synthesis side - namely that sometimes the assistant message passed to synthesizer is split on the . character that appears in decimal values (like currencies).

This ends up sounding very confusing to the user. The synthesized audio will be something like:

Your balance is two-hundred-fifty. (pause) Thirty-five...

Instead of:

Your balance is two-hundred-fifty and thirty-five.

This is a separate issue that I need to revalidate with the 0.1.111a3 pre-release. (I haven't seen it so far.)

Thanks again!

bjquinn commented 1 year ago

I have identified the problem. Vocode splits the OpenAI response on sentences in order to synthesize them as fast as possible.

@HHousen Thank you for the explanation here. I believe this also explains an intermittent problem I have been seeing on the synthesis side - namely that sometimes the assistant message passed to synthesizer is split on the . character that appears in decimal values (like currencies).

This ends up sounding very confusing to the user. The synthesized audio will be something like:

Your balance is two-hundred-fifty. (pause) Thirty-five...

Instead of:

Your balance is two-hundred-fifty and thirty-five.

This is a separate issue that I need to revalidate with the 0.1.111a3 pre-release. (I haven't seen it so far.)

Thanks again!

Yes, that's true, though I don't think this fix will solve that. See https://github.com/vocodedev/vocode-python/issues/338 for an issue I submitted that has other quirks about premature sentence ending detection. For now, if this is helpful to you, I simply asked GPT in the system prompt to spell out all dollar amounts, and that seems to work well.