microsoft / cognitive-services-speech-sdk-js

Microsoft Azure Cognitive Services Speech SDK for JavaScript
Other
263 stars 98 forks source link

SpeechSynthesisResult returned from speakSsmlAsync reports inaccurate audio duration for synthesized audio #644

Closed GJStevenson closed 1 year ago

GJStevenson commented 1 year ago

Describe the bug

When synthesizing audio from text, the reported audio duration does not match the actual duration of the audio that is returned. The duration seems to be ~0.08 seconds off the actual generated audio. This seems to happen regardless of what voice model, or what adjustments are made (pitch, pace, style, etc)

To Reproduce

  1. Synthesize some audio for any amount of text using SpeechSynthesizer set to the output format Audio16Khz32KBitRateMonoMp3.
  2. In the speakSsmlAsync callback, log the reported audioDuration.
  3. Compare that duration to the actual duration of the audioData

Expected

These values are the same

Actual

They are off by a little bit.

One example of SSML that has this issue is the following:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
       <prosody rate="0%" pitch="0%">
          <mstts:express-as style="Default">
             This is a test to demonstrate the reported audio duration differing slightly from the actual generated audio.
          </mstts:express-as>
       </prosody>
    </voice>
 </speak>

The reported audio duration was reported to be: 67375000 (6.7375 seconds)

The actual audio duration after importing the file into Audacity was 6.840 seconds.

speech_synthesis_duration_bug.mp3.zip

And here is a snippet of the speakSsmlAsync callback:

synthesizer.speakSsmlAsync(ssml, result => {
            console.debug('Duration: ', result.audioDuration);
            const filePath = path.join('/Users/g.stevenson/Desktop/foo.mp3');
            void fs.promises.writeFile(filePath, Buffer.from(result.audioData));
            ...
});
yulin-li commented 1 year ago

Hi, thanks for reporting this and sorry for the late reply. The audioDuration reported in the result is the audio duration of syntheized raw wave. But LAME adds extra silence when encoding it to mp3, (see this).

GJStevenson commented 1 year ago

Ah that would explain it. Thank you!