transports(base_output): use audio frames for bot speaking detection

pipecat-ai / pipecat

Open Source framework for voice and multimodal conversational AI

BSD 2-Clause "Simplified" License

3.47k stars 341 forks source link

@aconchillo this is an interesting change! I'm in favor of making it. Essentially, it would offer to separate events:

The existing bot-tts-stopped event would indicate that the TTS service has spoken all of the text provided to us. This coincides with the TTSStoppedFrame.

The bot-stopped-speaking event would indicate that the speaking turn is complete (i.e. TTS has finished and the VAD duration has stopped).

bot-stopped-speaking is a nice parallel to the user equivalent event. And, the two events provide flexibility to what developers can build.

I'm in favor of making this change. I'm not as familiar with this area of the code, so I'll defer that to @kwindla.

That's correct. Only one detail: It's possible that you get bot stopped speaking before the TTS stopped. This is because of how some TTS services work and the timeouts we have set to detect when TTS services don't send more data.

That is, the timeout of the bot stopped speakign is different than the timeout we have to send TTS stopped downstream. It feels that maybe they should be the same.

pipecat-ai / pipecat

transports(base_output): use audio frames for bot speaking detection #667