livekit / agents

Build real-time multimodal AI applications 🤖🎙️📹
https://docs.livekit.io/agents
Apache License 2.0
686 stars 121 forks source link

Elevenlabs TTS websocket connection design #306

Open fjprobos opened 1 month ago

fjprobos commented 1 month ago

Hi,

I was able to make the minimal_assistant.py implementation work. Once I sorted out all the difficulties, it runs pretty well! Kudos for that 😃.

I have a question regarding the WebSocket connections used in the ElevenLabs TTS module. In my environment, I noticed that the WebSocket creation is being triggered every time the agent responds to the user. Consequently, the WebSocket is being closed every time the agent stops talking.

Questions:

I believe closing and reopening the WebSocket repeatedly introduces unnecessary overhead. Maintaining one or a few stable connections throughout the session might be more efficient.

Looking forward to your insights on this.

Thank you!

keepingitneil commented 1 month ago

This was a constraint of ElevenLabs. Additional text can't be sent on the same websocket after an EOS and the EOS signal is used to flush.

Looking at their docs now, it looks like they have since introduced a "flush" flag in their protocol which we can look into using.

With that being said, typically there is no additional latency introduced to the end-user with this strategy because the next websocket connection will have been connected long before speech generation is needed.

fjprobos commented 1 month ago

Thanks for the clarification! Regarding the later, I saw something different when debugging though. The connection is being established when the synthesis task is started (in the line I referenced in my first message), and no active connection is available at that point as you suggest.

El El jue, may. 23, 2024 a la(s) 6:23 p. m., Neil Dwyer < @.***> escribió:

This was a constraint of ElevenLabs. Additional text can't be sent on the same websocket after an EOS and the EOS signal is used to flush.

Looking at their docs now, it looks like they have since introduced a "flush" flag in their protocol which we can look into using.

With that being said, typically there is no additional latency introduced to the end-user with this strategy because the next websocket connection will have been connected long before speech generation is needed.

— Reply to this email directly, view it on GitHub https://github.com/livekit/agents/issues/306#issuecomment-2128041008, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6I2Q2KSOK3PROGK6QFPXLZDZM37AVCNFSM6AAAAABIAUBITKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRYGA2DCMBQHA . You are receiving this because you authored the thread.Message ID: @.***>