I looked into the StreamAdapter and found that it was re-emitting the VAD start/end of speech events. I was planning to use those to capture timing, but then I found that the END_OF_SPEECH event is delayed until after the FINAL_TRANSCRIPT, meaning that the timing now includes inference and API call overhead.
It looks like the END_OF_SPEECH event includes the first alternative just for convenience. I would propose to propagate the VAD events as-is in the adapter, and direct user to the transcript events to get transcription results.
I'm having trouble capturing timing information with VAD + STT.
given:
I looked into the StreamAdapter and found that it was re-emitting the VAD start/end of speech events. I was planning to use those to capture timing, but then I found that the END_OF_SPEECH event is delayed until after the FINAL_TRANSCRIPT, meaning that the timing now includes inference and API call overhead.
It looks like the END_OF_SPEECH event includes the first alternative just for convenience. I would propose to propagate the VAD events as-is in the adapter, and direct user to the transcript events to get transcription results.