syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"
368 stars 110 forks source link

Sample Alignment Graph #10

Open fazlekarim opened 6 years ago

fazlekarim commented 6 years ago

Hi,

Can you share the alignment graphs that you are obtaining for your audio samples? For most of my alignments, the y-axis is about half of the x-axis. Is there a reason why this is happening? In Keithito's repo, the shared alignment graphs have a 1-1 scale. In other words, the range of the x-axis and the y-axis is the same.

syang1993 commented 6 years ago

@fazlekarim You can find them in the demo page dir: https://github.com/syang1993/syang1993.github.io/tree/master/gst-tacotron/style-samples

In keithito's tacotron, reduce_factor is 5, in which case the length of characters and frames are similar. But in this repo, reduce factor is 2, the mel-spec is about 2 times longer than text.

zyj008 commented 6 years ago

@fazlekarim I have the same problem with you that the y-axis is about half or even more of the x-axis. How did you solve the problem?

abuvaneswari commented 6 years ago

@syang1993, in my case, all the alignment graphs generated at the point of checkpoints (every 1000 steps) turn out to be the way described by @zyj008. I attach a sample png:

gst-step-1147000-align

If I use regular Tacotron from keithito, the range of both axes turns out to be right about the same.

Do you have an explanation?

syang1993 commented 6 years ago

@abuvaneswari Hi, as I described above, the x-axis means the length of mel-spectrum and the y-axis means the number of characters. The alignment path (attention matrix) only shows the weights between each character and each frame. In your attached image, there are about 70 characters, and the corresponding audio has about 250 frames. I use reduce_factor=2 so the number is about 125 (x-axis length), if you use reduce_factor=5 as Keithito's repo, the number is about 50.