openai / openai-realtime-api-beta

Node.js + JavaScript reference client for the Realtime API (beta)
MIT License
666 stars 155 forks source link

interruption toggle #26

Open vincentwi opened 1 month ago

vincentwi commented 1 month ago

Hi, I would like to request adding an interruption bool in updateSession. The expected behavior would be to toggle (on/off) the interruption mechanism of the assistant when the user speaks over them. This is particularly useful in scenarious where it is ok for both several agents to be produce speech simultaneously (e.g. split-screen sports broadcasting, poetry/singing over music, game characters exclaiming, commentary on a debate, etc).

It should be pretty simple to implement: 1) a flag to comment out this line https://github.com/openai/openai-realtime-api-beta/blob/d7bf27b842638f01c0d07d517d0d8a1b9ce4b63b/lib/client.js#L323 2) refrain from using cancelResponse in client code, e.g. from real-time-console: https://github.com/openai/openai-realtime-console/blob/6ea4dba795fee868c60ea9e8e7eba7469974b3e9/src/pages/ConsolePage.tsx#L238

khorwood-openai commented 1 month ago

The conversation.interrupted event is just an event -- by default it doesn't trigger any method, just something you can listen to. I would just avoid using the cancelResponse method. In server_vad mode the voice detection is automatic (happens server-side, not client-side) -- the client code is only to truncate audio or manually interrupt.

You can disabled server_vad mode and you should be good. Let me know?

toncic commented 4 weeks ago

Hello, I have a case where I need an assistant to listen to the conversation in time and extract some specific information. Because it's listening to a conversation, the assistant needs to have server_vad enabled. However, the assistant's response gets interrupted every time, so there is no useful output.

Ideally, I would need an assistant who would listen constantly and asynchronously(simultaneously) extract information from the discussion when it appears.

Any idea how this could be achieved?

ryanmmmmm commented 3 days ago

we ar enot cancelling the repsonse but we continuously keep sending audio because the person is continuously speaking and we just want the translated repsonse as they talk.. but the API STOPS processing the audio when its inturrpupted we dont get the full audio translation.. i am not folowing the statement: "The conversation.interrupted event is just an event " ..we do see those events but we are NOT calling cancelResponse anywhere....but thew llm does stop sending an audio repsonse to those audp streams we append oni server_vad mode for..

vincentwi commented 2 days ago

yes, overdubbng live translation is another good example of a scenario where it is ok for AI to produce speech simultaneously. on top of the ones identified in my first comment, dynamic computer-control, comedy improv, NotebookLM style podcasts, film/reels/media voiceover, twitch commentary, etc are more use cases.

@khorwood-openai disabling server_vad did not solve the problem, nor did omitting cancelResponse. I think @ryanmmmmm and I and dealing with the problem where we are continuously uploading ~1-5s input chunks via response.create, and the server won't process them simultaneously. Turn detection seems built in based on silence/pauses.

Hoping your team may have suggestions on how to get the audio processed asyncronously rather than putting clips on a queue.