microsoft / cognitive-services-speech-sdk-js

Microsoft Azure Cognitive Services Speech SDK for JavaScript
Other
263 stars 96 forks source link

excessive generated silence at end of tts #691

Closed davehorton closed 1 year ago

davehorton commented 1 year ago

I am using TTS in Node.js like this:

      const speakAsync = content.startsWith('<speak') ?
        synthesizer.speakSsmlAsync.bind(synthesizer) :
        synthesizer.speakTextAsync.bind(synthesizer);
      speakAsync(
        content,
        async(result) => {
          switch (result.reason) {
            case ResultReason.SynthesizingAudioCompleted:
              /* save result.audioData to a file here */
              synthesizer.close();
              break;
            //..handle other reasons
            default:
              logger.info({result}, 'synthAudio: (Microsoft) unexpected result');
              break;
          }

and I am noticing that I am getting about 800 milliseconds of silence at the end of the audio stream that is returned. This then creates a problem for me as this is part of a larger conversational AI system where I play a prompt to the user and then collect their response -- but the "prompt" includes all this silence so quite often the user starts talking during this playback of silence yet I have not started recognition because I do that after the audio is fully played.

I realize that I can bring in something like ffmpeg to postprocess the audio and trim the silence, but given that the amount of silence returned seems excessive (and quite a bit more than other TTS engines) I was wondering if there was any way to tune or configure things to avoid having the TTS engine return this silence.

Pasting a screenshot of the generated audio here to make it clearer

image

Above you can see there is quite a bit of silence at the end of the audio stream that azure returns.

Meanwhile, if I generate the same audio using google I get quite a bit less silence at the end: image

CrystalWLH commented 1 year ago

Hi @davehorton, we support a variety of silence length control methods by SSML silence tag. Details can be found in this doc

glharper commented 1 year ago

Closing as answered