twilio-labs / call-gpt

Generative AI phone call toolkit using Twilio Media Streams.
MIT License
267 stars 110 forks source link

Improve AI Voice Agent Response Time by Utilizing WebSocket for Streaming Audio #48

Open shakir-snakescript opened 2 months ago

shakir-snakescript commented 2 months ago

I have observed that the current implementation of the AI voice agent, which uses OpenAI, Deepgram, and Twilio, experiences a delay of 4-5 seconds before responding when a call begins. This is despite using the stream = true feature. It appears that the response is delayed until the stream is completed.

In the current implementation, there is a loop that buffers the audio: while(Object.prototype.hasOwnProperty.call(this.audioBuffer, this.expectedAudioIndex)) { const bufferedAudio = this.audioBuffer[this.expectedAudioIndex]; this.sendAudio(bufferedAudio); this.expectedAudioIndex++; } } else { this.audioBuffer[index] = audio; }

I believe this delay can be reduced by utilizing WebSocket within the while loop to stream the audio in chunks, rather than waiting for the entire stream to connect.

By implementing WebSocket for chunk-by-chunk streaming, the AI voice agent can respond more promptly, significantly enhancing the user experience.

Please let me know if I make sense or there is any reason for you to handle the stream in this way?

akashkaushik33 commented 1 month ago

Don't you think it will reduce the time by very minor margins? The only difference will be, we will send data in chunks rather than in one go. Depending on the internet speed there will be different results for reduction in delay. But this will surely help.

What I was thinking is, we should optimize the time delay in between when a user stops speaking and sending the audio. If there is a background noise of a certain level in that case also, the time increases significantly while sending the audio since the listener will register it as a foreground event and will wait until that noise subsides.