livekit / agents

Build real-time multimodal AI applications 🤖🎙️📹
https://docs.livekit.io/agents
Apache License 2.0
645 stars 103 forks source link

STT Timing Information -> Propose emitting END_OF_SPEECH before FINAL_TRANSCRIPT #326

Open tortis opened 1 month ago

tortis commented 1 month ago

I'm having trouble capturing timing information with VAD + STT.

given:

openai_stt = openai.STT()
vad = silero.VAD()
vad_stream = vad.stream()
stt = StreamAdapter(openai_stt, vad_stream)
stt_stream = stt.stream()

I looked into the StreamAdapter and found that it was re-emitting the VAD start/end of speech events. I was planning to use those to capture timing, but then I found that the END_OF_SPEECH event is delayed until after the FINAL_TRANSCRIPT, meaning that the timing now includes inference and API call overhead.

It looks like the END_OF_SPEECH event includes the first alternative just for convenience. I would propose to propagate the VAD events as-is in the adapter, and direct user to the transcript events to get transcription results.

diff --git a/livekit-agents/livekit/agents/stt/stream_adapter.py b/livekit-agents/livekit/agents/stt/stream_adapter.py
index 7050178..9b2d918 100644
--- a/livekit-agents/livekit/agents/stt/stream_adapter.py
+++ b/livekit-agents/livekit/agents/stt/stream_adapter.py
@@ -76,6 +76,9 @@ class StreamAdapterWrapper(SpeechStream):
                     start_event = SpeechEvent(SpeechEventType.START_OF_SPEECH)
                     self._event_queue.put_nowait(start_event)
                 elif event.type == VADEventType.END_OF_SPEECH:
+                    end_event = SpeechEvent(type=SpeechEventType.END_OF_SPEECH)
+                    self._event_queue.put_nowait(end_event)
+
                     merged_frames = merge_frames(event.frames)
                     event = await self._stt.recognize(
                         buffer=merged_frames, *self._args, **self._kwargs
@@ -87,12 +90,6 @@ class StreamAdapterWrapper(SpeechStream):
                         alternatives=[event.alternatives[0]],
                     )
                     self._event_queue.put_nowait(final_event)
-
-                    end_event = SpeechEvent(
-                        type=SpeechEventType.END_OF_SPEECH,
-                        alternatives=[event.alternatives[0]],
-                    )
-                    self._event_queue.put_nowait(end_event)
         except Exception:
             logging.exception("stt stream adapter failed")
         finally:
vanics commented 2 weeks ago

+1