Audio input quality seems bad

brainvine commented 1 week ago

I've noticed that the audio input quality is quite poor, which results in the AI struggling to understand what's being said. When using the transcribe function from OpenAI, I observe numerous transcription errors. These issues don't occur in the OpenAI Playground that often.

The errorrate is so high I don't think I can put this live, especially considering people are often not that clear as I am within the testing environment.

Possible Cause:

I suspect this might be related to the audio input format (g711_ulaw). The Playground uses PCM16, which seems to offer much better quality. However, the difference in transcription quality shouldn't be this significant, regardless of the format used I hope?

Request:

Could it be possible to review or adjust the audio input format to PCM16, or investigate why there's such a significant difference in transcription quality between the current implementation and the Playground?

Is there a way to increase input quality somehow?

badereddineqodia commented 1 week ago

I totally agree with you @brainvine. I've noticed the same issue, especially when dealing with local dialects. While he can somewhat handle English (not Good), it struggles significantly with dialects like Darija or other regional languages. The difference in transcription accuracy between the current implementation and the OpenAI Playground is very noticeable.

badereddineqodia commented 1 week ago

@pkamp3 Do you have any ideas? Can we work with PCM16 rather than G.711 u-law?

brainvine commented 1 week ago

I've been thinking about it, and I don't believe the codec is the main issue (although any improvements might enhance accuracy). The overall audio quality seems to be the real problem. Since it's coming from a phone and involves compression, the quality is inherently worse than direct audio input from the device, regardless of any adjustments.

This is something OpenAI should address. They might need to integrate more advanced listening models into their RealtimeAPI backend, trained on actual phone calls. I've noticed that the Whisper-1 voice-to-text engine, used for transcription, isn't the same as the one used in the RealtimeAPI for the actual audio processing. Sometimes, the transcribed speech in the logs is way off, even though the actual model seems to understand me quite well.

kjjd84 commented 5 days ago

same issue here

also the ai voice is crackling a little bit now

alfiesal commented 3 days ago

Does this work for anyone? I can't achieve the same results as in the tutorial. Somehow, the audio quality is poor, which makes the Agent not understand what I'm saying. When I use direct integration with RealtimeAPI, everything works perfectly.

badereddineqodia commented 1 day ago

I don't fully understand what the issue is, as the problem doesn't seem clearly defined. Is the problem related to Twilio or OpenAI? I used Twilio with Azure's Speech-to-Text, and it worked quite well. However, when using Twilio with OpenAI, the results seem poor.

I suspect there may be an issue with the audio encoding between Twilio and OpenAI. Twilio might not be supporting OpenAI's default audio encoding format, or OpenAI may need to add broader support for more formats. Currently, we have to perform a transformation in between: we receive the audio data from Twilio, transform it, and then send it to OpenAI.

pkamp3 commented 1 day ago

Hey all, I hear your concerns but we're limited by the channel and this is out of scope for this repo. G.711 A-law or u-law are what we need to use for calls on the PSTN, and that's what Media Streams supports.

Let me point you at a few ideas, though:

You might be able to make the conversion better yourself. As I'm writing this, OpenAI wants PCM16 24 kHz little-endian if you make the conversion with some optimizations, filters, or your other ideas.
Our Voice client supports OPUS, which might work better for you – but it depends on whether your call crosses the PSTN. In this case, you'd still need to transcode before sending audio between the two WebSockets. My colleague Michael Carpenter wrote a bit about considerations here.

twilio-samples / speech-assistant-openai-realtime-api-node

Audio input quality seems bad #13