DTW anchor indexing problem due to non-integer TTS sample rate * shift (was: Systematic negative bias observable in longer audios)

ozdefir commented 8 years ago

With longer audios I observe a consistent negative bias which increases gradually towards the end. To make sure it's not a playback issue I tested with Audacity which confirmed the observation. Examples:

https://readiance.org/finetuneas/librivox/the-brothers-karamazov-by-fyodor-dostoyevsky/40-book-6-chapter-2-the-duel-the https://readiance.org/finetuneas/librivox/childrens-short-works-vol-011-by-various/the-little-mermaid-childrens-short-works?g=s

The alignments are almost perfect, so I thought it could be due to floating point math or rounding.

ozdefir commented 8 years ago

You may want to know, I use mfcc_window_length=0.100 and mfcc_window_shift=0.025 With the default settings the deviation is smaller but it's still there.

readbeyond commented 8 years ago

I think that the most likely explanation is that the audio file contains a tail ("End of Chapter Two of Book Six etc."), rather than rounding errors.

I see that you created an "END" fragment, but I guess it is not enough for the DTW algorithm to correctly "exclude" the tail.

To confirm this, I downloaded the MP3 and the XHTML file, and run aeneas twice:

without any head/tail parameter;
then, specifying the audio/tail duration.

In the second case, I got a perfect sync map, as expected.

Of course, your observation still hints at the problem of reliably identifying the head/tail of audio files, which is not 100% resolved with the current approach.

EDIT: attaching a ZIP with the resulting JSON files, in case you want to check. karamazov.zip

readbeyond commented 8 years ago

I changed the issue title to better reflect the issue at hand.

ozdefir commented 8 years ago

The "END" fragments that you see on those pages aren't present in the syncmaps, they are added by JS for convenience.

The problem isn't about heads and tails. They are fixed separately before the alignment.

After some empirical work I noticed that the deviation depends on the window shift. When it's multiples of 0.020 (as in the default value of 0.040) there is no deviation. But it peaks slightly before these values (e.g. 0.039), which suggests a modulo effect.

If I'm not misreading it the reason is this line in the dtw module:

anchor_indices = numpy.array([int(a[0] / mws) for a in synt_anchors])

The user-entered window shift value is used directly. Instead one that corresponds to the frame rate (synth) should be used: actual_mws = int(mws*sample_rate)/sample_rate

Since espeak outputs have 22050 sample rate by default, multiples of 0.020 result in integers and aren't affected by truncation.

readbeyond commented 8 years ago

Hi,

I misunderstood the initial description, and having run my experiments with the default window length/shift parameters, I did not reproduce the issue. My bad.

Now I see the issue, and yes, it occurs when

mfcc_shift * sample_rate(tts)

is not an integer. Since sample_rate(espeak)=22050, this imples mfcc_shift multiple of 0.020s.

Initially I thought this would simply inject a spurious shift of at most mfcc_shift into the timings since it offsets the anchor index by 1, but actually the story is more complex.

By comparing two runs, with window length/shift = (0.100, 0.040) and (0.100, 0.025) respectively, one can see that:

my initial thought of a simple 0 or mfcc_shift delay is clearly wrong;
it is not a simple accumulation effect either. In fact, computing the sequence of differences for the fragment begin or end times between 0.040 and 0.025 output, the differences are not monotonically increasing. This is because while the shift in the anchor time is monotonically increasing, there is the DTW algorithm running in cascade after it, and its output is not "monotonically related" to the anchor points.

I tested (also on the Karamazov file) the fix you propose to dtw.py, and it seems to solve the issue, as one expects, since it brings the anchor point closer to where it should be.

Considering that aeneas now supports different TTS engines, I feel that the only elegant fix for this issue consists in adding a new step: downsampling the TTS output to 16000 Hz, as it will ensure that the above product is integer, for any value of mfcc_shift multiple of 0.001s.

BTW, some time ago I asked Reece Dunn if espeak(-ng) can support an option to synthesize speech at a user-specified sample rate, but it seems not doable easily: https://github.com/espeak-ng/espeak-ng/issues/88

To recap:

for now, either run long files with a shift multiple of 0.020s, or patch dtw.py as suggested above;
for the future, I will implement a new step, downsampling the TTS output (espeak or other) to 16000 Hz.

Thank you for reporting this issue!

AP karamazov2.zip

readbeyond commented 8 years ago

Let me also note that in theory this issue might affect the word-level alignment when using multi-level files (currently, L3 shift = 0.005s). However, since in general each sentence is short, the effect should be minimal.

pettarin commented 8 years ago

I decided to prioritize fixing this. I will take the chance to also address other TTS-related things like #87 . This will imply that the next release will be 1.6.0 rather 1.5.2, as I will need to slightly modify the aeneas TTS API.

readbeyond commented 8 years ago

@ozdefir it should be fixed by the code currently in the devel branch. I still need to implement the caching mechanism and another couple of things before releasing as v1.6.0, though.

readbeyond commented 8 years ago

@ozdefir Leaving open in case you want to check the code and confirm it is fixed.

ozdefir commented 8 years ago

Just checked it and it looks fine. There's no timestamp shift due to mfcc_shift setting.

readbeyond commented 8 years ago

On 09/19/2016 12:56 AM, Firat Özdemir wrote:

Just checked it and it looks fine. There's no timestamp shift due to mfcc_shift setting.

Great, thank you for taking time to check!

I will close this issue when releasing v1.6.0.

AP

readbeyond commented 8 years ago

On devel now, closing.

readbeyond / aeneas

DTW anchor indexing problem due to non-integer TTS sample rate * shift (was: Systematic negative bias observable in longer audios) #102