Closed GJStevenson closed 1 year ago
Hi, thanks for reporting this and sorry for the late reply. The audioDuration
reported in the result is the audio duration of syntheized raw wave. But LAME
adds extra silence when encoding it to mp3, (see this).
Ah that would explain it. Thank you!
Describe the bug
When synthesizing audio from text, the reported audio duration does not match the actual duration of the audio that is returned. The duration seems to be ~0.08 seconds off the actual generated audio. This seems to happen regardless of what voice model, or what adjustments are made (pitch, pace, style, etc)
To Reproduce
SpeechSynthesizer
set to the output formatAudio16Khz32KBitRateMonoMp3
.speakSsmlAsync
callback, log the reportedaudioDuration
.audioData
Expected
These values are the same
Actual
They are off by a little bit.
One example of SSML that has this issue is the following:
The reported audio duration was reported to be:
67375000
(6.7375
seconds)The actual audio duration after importing the file into Audacity was
6.840
seconds.speech_synthesis_duration_bug.mp3.zip
And here is a snippet of the
speakSsmlAsync
callback: