readbeyond / aeneas

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
http://www.readbeyond.it/aeneas/
GNU Affero General Public License v3.0
2.45k stars 218 forks source link

Sample rate mismatch leads to incorrect timing #260

Open Yorwba opened 3 years ago

Yorwba commented 3 years ago

To reproduce, run aeneas (latest devel) with four different configurations, either enabling or disabling cew and using a sample rate of either 16000 or 22050 for ffmpeg:

for conf in cew={True,False}'|'ffmpeg_sample_rate={16000,22050}; do
    python -m aeneas.tools.execute_task -v aeneas/tools/res/audio.mp3 aeneas/tools/res/plain.txt 'task_language=eng|is_text_type=plain|os_task_file_format=srt' -r="$conf" sonnet-"$conf".srt
done

Then look at the last 4 lines of each:

tail -n4 *.srt
==> sonnet-cew=False|ffmpeg_sample_rate=16000.srt <==
15
00:00:53,200 --> 00:00:53,240
To eat the world's due, by the grave and thee.

==> sonnet-cew=False|ffmpeg_sample_rate=22050.srt <==
15
00:00:48,000 --> 00:00:53,240
To eat the world's due, by the grave and thee.

==> sonnet-cew=True|ffmpeg_sample_rate=16000.srt <==
15
00:00:48,080 --> 00:00:53,240
To eat the world's due, by the grave and thee.

==> sonnet-cew=True|ffmpeg_sample_rate=22050.srt <==
15
00:00:48,000 --> 00:00:53,240
To eat the world's due, by the grave and thee.

Note that the last segment starts at roughly 48 seconds except for the combination cew=False|ffmpeg_sample_rate=16000, where it starts at 53.2 seconds instead.

Here's a snippet from the verbose output of a run with that configuration, highlighting important lines with ----->:

[DEBU] Synthesizer: Synthesizing text...
[DEBU] ESPEAKTTSWrapper: Calling TTS engine via C extension or subprocess
[DEBU] ESPEAKTTSWrapper: C extension 'cew' disabled
[DEBU] ESPEAKTTSWrapper: Running the pure Python code
[DEBU] ESPEAKTTSWrapper: Synthesizing multiple via subprocess...
[DEBU] ESPEAKTTSWrapper: Calling TTS engine using multiple generic function...
[DEBU] ESPEAKTTSWrapper: Determining codec and sample rate...
[DEBU] ESPEAKTTSWrapper: Reading codec and sample rate from OUTPUT_AUDIO_FORMAT
[DEBU] ESPEAKTTSWrapper: Determining codec and sample rate... done
[DEBU] ESPEAKTTSWrapper:   codec:       pcm_s16le
-----> ESPEAKTTSWrapper:   sample rate: 22050
[DEBU] ESPEAKTTSWrapper: Examining fragment 0 (no cache)...
[DEBU] ESPEAKTTSWrapper: Language to voice code: 'eng' => 'en'
[DEBU] ESPEAKTTSWrapper: Calling helper function
[DEBU] ESPEAKTTSWrapper: Synthesizer helper called with output_file_path=None => creating temporary output file
[DEBU] ESPEAKTTSWrapper: Temporary output file path is '/tmp/tmp30di9k3w.wav'
[DEBU] ESPEAKTTSWrapper: TTS engine reads text from stdin
[DEBU] ESPEAKTTSWrapper: Creating arguments list...
[DEBU] ESPEAKTTSWrapper: Creating arguments list... done
[DEBU] ESPEAKTTSWrapper: Calling TTS engine...
[DEBU] ESPEAKTTSWrapper: Calling with arguments '['espeak', '-v', 'en', '-w', '/tmp/tmp30di9k3w.wav']'
[DEBU] ESPEAKTTSWrapper: Calling with text '1'
[DEBU] ESPEAKTTSWrapper: Passing text via stdin...
[DEBU] ESPEAKTTSWrapper: Passing text via stdin... done
[DEBU] ESPEAKTTSWrapper: TTS engine wrote audio data to file
[DEBU] ESPEAKTTSWrapper: Calling TTS ... done
[DEBU] ESPEAKTTSWrapper: Reading audio data...
[DEBU] AudioFile: Loading audio data...
[DEBU] AudioFile: self.file_format is None or not good => converting self.file_path
[DEBU] AudioFile: Temporary PCM16 mono WAVE file: '/tmp/tmp_ow6yas8.wav'
[DEBU] AudioFile: Converting audio file to mono...
-----> FFMPEGWrapper: Calling with arguments '['ffmpeg', '-i', '/tmp/tmp30di9k3w.wav', '-ac', '1', '-ar', '16000', '-y', '-map_metadata', '-1', '-flags', '+bitexact', '-f', 'wav', '/tmp/tmp_ow6yas8.wav']'
[DEBU] FFMPEGWrapper: Call completed
[DEBU] FFMPEGWrapper: Returning output file path '/tmp/tmp_ow6yas8.wav'
[DEBU] AudioFile: Converting audio file to mono... done
[DEBU] AudioFile: Deleted temporary audio file: '/tmp/tmp_ow6yas8.wav'
[DEBU] AudioFile: Sample length:  0.638
-----> AudioFile: Sample rate:    16000
[DEBU] AudioFile: Audio format:   pcm16
[DEBU] AudioFile: Audio channels: 1
[DEBU] AudioFile: Loading audio data... done

What happens is this:

  1. Since cew is disabled, the synthesized audio for each line is concatenated in Python code.
  2. A buffer is allocated to hold the concatenated audio and its sample rate is determined to be 22050.
  3. Before a file is loaded into the buffer, it is converted to mono using ffmpeg.
  4. Since the ffmpeg command specifies a sample rate of 16000, the samples loaded into the buffer do not have the expected sample rate of 22050.
  5. When the concatenated buffer is written to a file, the sample rate is set to 22050, causing the audio to appear sped up.
  6. As a consequence, all timestamps are out of sync.

I'd like to fix this, but beforehand I'd like to know why things are done this way. Evidently the true sample rate of the file is known once the data gets loaded, so is it ever necessary to set the sample rate beforehand?

WalkaboutPianoMan commented 3 years ago

Hi Yorwba... Any chance you could tell me how to fix this bug? I'm working on a lot of audio/text syncing with 22kHz files and so this fix would be a life saver for me :-) Peter