const speakAsync = content.startsWith('<speak') ?
synthesizer.speakSsmlAsync.bind(synthesizer) :
synthesizer.speakTextAsync.bind(synthesizer);
speakAsync(
content,
async(result) => {
switch (result.reason) {
case ResultReason.SynthesizingAudioCompleted:
/* save result.audioData to a file here */
synthesizer.close();
break;
//..handle other reasons
default:
logger.info({result}, 'synthAudio: (Microsoft) unexpected result');
break;
}
and I am noticing that I am getting about 800 milliseconds of silence at the end of the audio stream that is returned. This then creates a problem for me as this is part of a larger conversational AI system where I play a prompt to the user and then collect their response -- but the "prompt" includes all this silence so quite often the user starts talking during this playback of silence yet I have not started recognition because I do that after the audio is fully played.
I realize that I can bring in something like ffmpeg to postprocess the audio and trim the silence, but given that the amount of silence returned seems excessive (and quite a bit more than other TTS engines) I was wondering if there was any way to tune or configure things to avoid having the TTS engine return this silence.
Pasting a screenshot of the generated audio here to make it clearer
Above you can see there is quite a bit of silence at the end of the audio stream that azure returns.
Meanwhile, if I generate the same audio using google I get quite a bit less silence at the end:
I am using TTS in Node.js like this:
and I am noticing that I am getting about 800 milliseconds of silence at the end of the audio stream that is returned. This then creates a problem for me as this is part of a larger conversational AI system where I play a prompt to the user and then collect their response -- but the "prompt" includes all this silence so quite often the user starts talking during this playback of silence yet I have not started recognition because I do that after the audio is fully played.
I realize that I can bring in something like ffmpeg to postprocess the audio and trim the silence, but given that the amount of silence returned seems excessive (and quite a bit more than other TTS engines) I was wondering if there was any way to tune or configure things to avoid having the TTS engine return this silence.
Pasting a screenshot of the generated audio here to make it clearer
Above you can see there is quite a bit of silence at the end of the audio stream that azure returns.
Meanwhile, if I generate the same audio using google I get quite a bit less silence at the end: