Open jezell opened 1 month ago
Thinking more on this, perhaps the energy filter could be used to initiate the stream / close the stream after a bit of inactivity so that the tts timings can come across the wire with an offset from the initial position, or the filter could keep track of the locations of the dropped frames to correct the timings returned from deepgram. Either way, timings are definitely important on the transcript data.
Makes sense to me, made a ticket internally, but won't start working on until mid-week. Happy to accept a PR on this if you get to it first
Recently an optimization was added to avoid sending blank audio frames to deepgram. This is nice if you don't need timings...
https://github.com/livekit/agents/pull/738/files
However if you need timings, #779 is hindered by this as deepgram won't return proper timings due to energy filter