server VAD doesn't seem to pick up voice stopping

robcontreras commented 3 weeks ago

I can successfully connect and and event is received when I start speaking, but it gets hanged here, it never detects when I stopped and it just stays there until the timeout is reached, am I missing something?

Session updated successfully: {'type': 'session.updated', 'event_id': 'event_AF8eSFQtCeDPjLip7kbgJ', 'session': {'id': 'sess_AF8eRPbIlNNeKrx6ghnHj', 'object': 'realtime.session', 'model': 'gpt-4o-realtime-preview-2024-10-01', 'expires_at': 1728172439, 'modalities': ['text', 'audio'], 'instructions': 'You are a helpful and bubbly AI assistant who loves to chat about anything the user is interested in and is prepared to offer them facts. You have a penchant for dad jokes, owl jokes, and rickrolling – subtly. Always stay positive, but work in a joke when appropriate.', 'voice': 'alloy', 'turn_detection': {'type': 'server_vad', 'threshold': 0.5, 'prefix_padding_ms': 300, 'silence_duration_ms': 500}, 'input_audio_format': 'pcm16', 'output_audio_format': 'g711_ulaw', 'input_audio_transcription': None, 'tool_choice': 'auto', 'temperature': 0.8, 'max_response_output_tokens': 'inf', 'tools': []}}

Received event: input_audio_buffer.speech_started {'type': 'input_audio_buffer.speech_started', 'event_id': 'event_AF8eZJem71dtYCftbFvOP', 'audio_start_ms': 512, 'item_id': 'item_AF8eZKJuxUJmFcjQzCoPN'}

jhmaddox commented 2 weeks ago

I spent a bit of time looking at this and got interruption working quite well on my branch. Here are a few things:

OpenAI's server VAD actually works quite well. Watch for the input_audio_buffer.speech_started event
OpenAI sends the entire generated audio upfront, in much shorter time than it takes to play the audio. Watch for the response.audio.delta event
The code in this project, as it is today, receives the entire audio from OpenAI and immediately sends it to Twilio where it is queued for play. Therefore, there isn't a way to "cancel" the playback during an interruption because the audio has already been sent to Twilio
The approach I settled on was to manage the audio queue myself: When audio is received from OpenAI, it is added to an audio buffer. The data in the buffer is sent "just in time" to Twilio. If an interruption occurs, then the buffer is cleared and the response truncated.

Hope that helps!

pkamp3 commented 2 weeks ago

Just want to chime in that we're looking at interruptions/preemption on the node version at the moment: https://github.com/twilio-samples/speech-assistant-openai-realtime-api-node/issues/9 .

@jhmaddox you're using a queue and Mark message to determine when to send the next response.audio.delta to Twilio?

frmsaul commented 2 weeks ago

@robcontreras

I had success with doing this:

                    if response['type'] == 'input_audio_buffer.speech_started':
                        await websocket.send_json({ "event": "clear",
                                                    "streamSid": stream_sid })

It flushes the twilio audi stream. Basically clears whatever you sent already.

pkamp3 commented 1 week ago

https://github.com/twilio-samples/speech-assistant-openai-realtime-api-python/pull/13 I have a PR here if anyone would like to test. It should work within a reasonable error around your interruption. We're looking internally as well.

pkamp3 commented 3 days ago

https://github.com/twilio-samples/speech-assistant-openai-realtime-api-python/pull/13 closing with this merged

twilio-samples / speech-assistant-openai-realtime-api-python

server VAD doesn't seem to pick up voice stopping #6