livekit / agents

Build real-time multimodal AI applications 🤖🎙️📹
https://docs.livekit.io/agents
Apache License 2.0
625 stars 95 forks source link

Agent speech output audio is interpreted as user speech #315

Open andrewjhogue opened 1 month ago

andrewjhogue commented 1 month ago

When using LiveKit agents, sometimes the agent hears its own TTS output (eg via the laptop speakers) which is then interpreted as speech from the user.

This then creates a feedback loop where the agent will then translate + respond a second time to its own speech output.

This only seems to happen when device volume is above ~25-30% and audio is being played through the device speakers.

To provide a seamless UX though, the user shouldn't have to worry about managing volume level in order to prevent this.

My current approach is:

  1. When instantiating a LiveKit room, enabling audioSuppression and echoCancellation, eg:

    <LiveKitRoom
        token={createAudioCoachingCallRequest.result.room_access_token}
        serverUrl={createAudioCoachingCallRequest.result.active_server_websocket_url}
        audio={{echoCancellation: true, noiseSuppression: true}}
        connect={true}
    >
  2. Enabling allowInterruptions=True in agent.py, eg:

    assistant = VoiceAssistant(
        ...,
        allow_interruptions=True,
    )
  3. Muting the user's mic on user_speech_committed + agent_started_speaking events, then unmuting on agent_speech_committed event (eg, after the Agent finishes speaking).

Muting the user's mic is a short-term workaround -- the main limitation being that the user can't interrupt the agent once it starts speaking.

Are there best practices for preventing this feedback loop / is this something LiveKit is working on addressing?

keepingitneil commented 1 month ago

Haven't had this issue on chrome + macbook. WebRTC echo cancellation is typically pretty good. What browser/device are you testing on?

andrewjhogue commented 1 month ago

Am running this on the latest Chrome x macbook (Ventura 13).

Haven't had it happen much on our live site yet - seems to be sporadic + happening locally, usually near the beginning of a session.