Closed ozdefir closed 8 years ago
You may want to know, I use mfcc_window_length=0.100 and mfcc_window_shift=0.025 With the default settings the deviation is smaller but it's still there.
I think that the most likely explanation is that the audio file contains a tail ("End of Chapter Two of Book Six etc."), rather than rounding errors.
I see that you created an "END" fragment, but I guess it is not enough for the DTW algorithm to correctly "exclude" the tail.
To confirm this, I downloaded the MP3 and the XHTML file, and run aeneas twice:
In the second case, I got a perfect sync map, as expected.
Of course, your observation still hints at the problem of reliably identifying the head/tail of audio files, which is not 100% resolved with the current approach.
EDIT: attaching a ZIP with the resulting JSON files, in case you want to check. karamazov.zip
I changed the issue title to better reflect the issue at hand.
The "END" fragments that you see on those pages aren't present in the syncmaps, they are added by JS for convenience.
The problem isn't about heads and tails. They are fixed separately before the alignment.
After some empirical work I noticed that the deviation depends on the window shift. When it's multiples of 0.020 (as in the default value of 0.040) there is no deviation. But it peaks slightly before these values (e.g. 0.039), which suggests a modulo effect.
If I'm not misreading it the reason is this line in the dtw module:
anchor_indices = numpy.array([int(a[0] / mws) for a in synt_anchors])
The user-entered window shift value is used directly. Instead one that corresponds to the frame rate (synth) should be used: actual_mws = int(mws*sample_rate)/sample_rate
Since espeak outputs have 22050 sample rate by default, multiples of 0.020 result in integers and aren't affected by truncation.
Hi,
I misunderstood the initial description, and having run my experiments with the default window length/shift parameters, I did not reproduce the issue. My bad.
Now I see the issue, and yes, it occurs when
mfcc_shift * sample_rate(tts)
is not an integer. Since sample_rate(espeak)=22050, this imples mfcc_shift multiple of 0.020s.
Initially I thought this would simply inject a spurious shift of at most mfcc_shift into the timings since it offsets the anchor index by 1, but actually the story is more complex.
By comparing two runs, with window length/shift = (0.100, 0.040) and (0.100, 0.025) respectively, one can see that:
I tested (also on the Karamazov file) the fix you propose to dtw.py, and it seems to solve the issue, as one expects, since it brings the anchor point closer to where it should be.
Considering that aeneas now supports different TTS engines, I feel that the only elegant fix for this issue consists in adding a new step: downsampling the TTS output to 16000 Hz, as it will ensure that the above product is integer, for any value of mfcc_shift multiple of 0.001s.
BTW, some time ago I asked Reece Dunn if espeak(-ng) can support an option to synthesize speech at a user-specified sample rate, but it seems not doable easily: https://github.com/espeak-ng/espeak-ng/issues/88
To recap:
Thank you for reporting this issue!
Let me also note that in theory this issue might affect the word-level alignment when using multi-level files (currently, L3 shift = 0.005s). However, since in general each sentence is short, the effect should be minimal.
I decided to prioritize fixing this. I will take the chance to also address other TTS-related things like #87 . This will imply that the next release will be 1.6.0 rather 1.5.2, as I will need to slightly modify the aeneas TTS API.
@ozdefir it should be fixed by the code currently in the devel
branch. I still need to implement the caching mechanism and another couple of things before releasing as v1.6.0, though.
@ozdefir Leaving open in case you want to check the code and confirm it is fixed.
Just checked it and it looks fine. There's no timestamp shift due to mfcc_shift setting.
On 09/19/2016 12:56 AM, Firat Özdemir wrote:
Just checked it and it looks fine. There's no timestamp shift due to mfcc_shift setting.
Great, thank you for taking time to check!
I will close this issue when releasing v1.6.0.
AP
On devel
now, closing.
With longer audios I observe a consistent negative bias which increases gradually towards the end. To make sure it's not a playback issue I tested with Audacity which confirmed the observation. Examples:
https://readiance.org/finetuneas/librivox/the-brothers-karamazov-by-fyodor-dostoyevsky/40-book-6-chapter-2-the-duel-the https://readiance.org/finetuneas/librivox/childrens-short-works-vol-011-by-various/the-little-mermaid-childrens-short-works?g=s
The alignments are almost perfect, so I thought it could be due to floating point math or rounding.