vocodedev / vocode-core

🤖 Build voice-based LLM agents. Modular + open source.
https://vocode.dev
MIT License
2.73k stars 459 forks source link

incorrect text encoding for swedish output running ChatGPTAgentConfig with generate_responses=True #39

Closed lefant closed 1 year ago

lefant commented 1 year ago

I noticed an issue with encoding when I set "generate_responses=True" in ChatGPTAgentConfig, then the text coming from chatgpt seems to be encoded incorrectly.

DEBUG:alfrid_python.vocode_examples.telephony_app:Human started speaking
DEBUG:alfrid_python.vocode_examples.telephony_app:Got transcription:  Hur många planeter finns I ett vårt provsystem?, confidence: 0.8232422
DEBUG:alfrid_python.vocode_examples.telephony_app:Generating response for transcription
DEBUG:alfrid_python.vocode_examples.telephony_app:Sent chunk 0 with size 8000

...

DEBUG:alfrid_python.vocode_examples.telephony_app:Sent chunk 9 with size 8000
DEBUG:alfrid_python.vocode_examples.telephony_app:Sent chunk 10 with size 1635
DEBUG:alfrid_python.vocode_examples.telephony_app:Message sent: Det finns åtta planeter som ingår i vårt solsystem: Merkurius, Venus, jorden, Mars, Jupiter, Saturnus, Uranus och Neptunus.

Det finns Ã¥tta planeter som ingÃ¥r i vÃ¥rt should be Det finns åtta planeter som ingår i vårt solsystem

if I run with generate_responses=False it looks like that.

I am running a self-hosted telephony server with Deepgram transcriber (configured for swedish language), chat gpt agent and azure synthesizer.

lefant commented 1 year ago

the identical issue is present when running the streaming_conversation.py and hosted_streaming_conversation.py examples

ajar98 commented 1 year ago

thanks @lefant ! this is due to our implement of consuming server-side events from the ChatGPT API: https://github.com/vocodedev/vocode-python/blob/d2f5c60840399fcf5e551bd4c33f4722fbf34446/vocode/streaming/agent/chat_gpt_agent.py#L122

yannrouillard commented 1 year ago

I had the same issue with the french language. The issue is indeed caused by the SSEClient which tries to automatically guess the encoding from the response header content-type. Since the Content-Type is text/event-stream, requests seem to default to return ISO-8859-1 (probably due to this: https://github.com/psf/requests/blob/51716c4ef390136b0d4b800ec7665dd5503e64fc/requests/utils.py#L555) and hence characters are not correctly decoded.

According to that https://html.spec.whatwg.org/multipage/server-sent-events.html#parsing-an-event-stream, it seems event stream should always be encoded as utf-8. I guess requests should better be fixed on that (I opened an issue here: https://github.com/psf/requests/issues/6427) meanwhile I patched vocode/streaming/utils/sse_client.py on line 96 with

        if self.resp.headers.get("content-type") == "text/event-stream":
            encoding = "utf-8"
        else:
            encoding = self.resp.encoding or self.resp.apparent_encoding

which fixes the issue.

I suppose we could directly hardcode the encoding here if SSEClient always process text/event-stream data but I am not expert enough on that to be sure.

ajar98 commented 1 year ago

@yannrouillard nice!! want to open a PR?

lefant commented 1 year ago

thanks @yannrouillard for figuring this out! I tested the change and it resolves my issue.

yannrouillard commented 1 year ago

@yannrouillard nice!! want to open a PR?

Would have done gracefully but I am too late to the party ! :-) Thanks @lefant and @ajar98