twilio-samples / speech-assistant-openai-realtime-api-node

MIT License
101 stars 51 forks source link

AI does not stop speaking after VAD detects interrupt #9

Open HarounAns opened 1 week ago

HarounAns commented 1 week ago

I’ve encountered an issue where the AI continues speaking even after the server's Voice Activity Detection (VAD) detects an interrupt. Ideally, when an interrupt is detected, the AI should stop talking immediately. However, in my case, the AI keeps talking for a bit longer before eventually moving to the next response at random.

I'm unsure if this is happening because the code isn't terminating the server response when an interrupt occurs, or if it's a side effect of the AI still being in beta and not fully refined.

Looking for suggestions or ideas on how to resolve this and get the AI to respond more consistently to interruptions.

badereddineqodia commented 1 week ago

+1

brainvine commented 1 week ago

Struggling to fix this as well. Seems they forgot some logic in the example to handle the detection. It should send a response.cancel when an interrupt is detected. Don't think this is because of the beta. It works pretty well in the Playground of OpenAI.

HarounAns commented 1 week ago

Hmm, my understanding of it is that you dont need to send response.cancel if you're in server_vad mode, it should just handle it. I wonder if the ws client still has previous audio buffers from openAI and sends it to the server without clearing it

badereddineqodia commented 1 week ago

@brainvine but why when we start talking, the speech_started event will be triggered, which means that the searver know that we are talking...

Surebob commented 1 week ago

+1

kjjd84 commented 1 week ago

+1

itamargero commented 1 week ago

+1

brainvine commented 1 week ago

@brainvine but why when we start talking, the speech_started event will be triggered, which means that the searver know that we are talking...

Yeah, but it never sends the interrupted event.

alanzou commented 6 days ago

+1

da1z commented 6 days ago

you need to clear twillio audio buffer by sending clear event into twillio websocket

brainvine commented 6 days ago

you need to clear twillio audio buffer by sending clear event into twillio websocket

Thanks! Can you give us an example on where / how to implement this?

I've added this:

if (response.type === 'input_audio_buffer.speech_started') {
                    console.log('User started speaking');
                    if (currentResponseId) {
                        console.log('Interrupting assistant');
                        const cancelMessage = {
                            type: 'response.cancel',
                            id: currentResponseId
                        };
                        openAiWs.send(JSON.stringify(cancelMessage));
                        currentResponseId = null;

                        // Clear Twilio buffer
                        const clearMessage = {
                            event: 'clear',
                            streamSid: streamSid,
                        };
                        connection.send(JSON.stringify(clearMessage));
                    }
                }

I don't think this works. Tested this but I'm having the same issue where I can't really interrupt the assistant.

Edit: It seems to work! Just need to put this at the top. Will test some more to see if it's robust.


if (response.type === 'input_audio_buffer.speech_started') {
                    console.log('User started speaking');
                    const clearMessage = {
                        event: 'clear',
                        streamSid: streamSid,
                    };
 connection.send(JSON.stringify(clearMessage));
....
da1z commented 5 days ago

@brainvine yes, exactly. openai generates and sends audio chunks faster than they are played to you. so, if the ai's response is 'hello, how are you doing?', twilio's buffer would contain the full message, even though you've only heard 'hello' so far. that's why you need to clear the buffer when there's an interruption.

pkamp3 commented 4 days ago

As an update, we're looking at an enhancement for this. In the meantime, if it works for your use case (read to the end for the side effect) sending a clear event to Twilio will work.

Server VAD handles interrupt detection/pre-emption, so when you receive input_audio_buffer.speech_started (or speech_stopped) you can send { "event": "clear", "streamSid": "..."} like @brainvine is showing. You shouldn't need to send response.cancel, but let me know if that isn't the case.

Side effect: the 'whole' conversation, even the interrupted and cleared part, will remain in the conversation as @da1z pointed out. A full solution needs logic with conversation.item.truncate or conversation.item.delete.

pkamp3 commented 3 days ago

https://github.com/twilio-samples/speech-assistant-openai-realtime-api-node/tree/ai_talks_first_server_vad_preemption if anyone wants to look at our initial run at it.

Here are the known issues as of today:

pkamp3 commented 1 day ago

https://github.com/twilio-samples/speech-assistant-openai-realtime-api-node/pull/15 we've been testing this version today if anyone wants to take a look.

It adds AI interruption. It estimates the elapsed time the user heard the AI response to truncate the conversation properly. It should work to within a few words (based on the chunking of response.audio.delta, the speech_started event, and Twilio media events and mark messages). Note: you might also want to use speech_stopped in your implementation, but this should demonstrate one way to estimate how much the user heard.

It also adds the "AI speaks first" feature.

Please let us know if it works for you. If it looks good on our end, we'll merge soon.