First thanks for this detailed original tacotron model and the wiki.
I've been trying to read the wiki, this code, and the Tacotron paper (https://arxiv.org/pdf/1703.10135.pdf) for the last several days, but am confused about something basic. As someone trying to learn text-to-speech models, I'm unclear about how the spectrogram of fixed-length is generated for a input text during training.
Max ground-truth clip length in LJSpeech dataset is 14sec, then wouldn't that indirectly define the max mel_frames in output to be 14(1/0.0125)= 1480=1120 ? What is max_sentence_length of the input-text after padding? I assume all the input sentences used during training and inference would be padded to a max_len, is that correct?
Another related issue which may be a beginner question: After encoder creates 256 hidden states (from 256 bidirectional lstms), isn't the decoder output limited to 256 frames (for output layer reduction factor r=1)? If I understand encoder-decoder correctly, if decoder is producing 1 frame per encoder state as r=1, then how can it produce more frames than encoder states?
First thanks for this detailed original tacotron model and the wiki.
I've been trying to read the wiki, this code, and the Tacotron paper (https://arxiv.org/pdf/1703.10135.pdf) for the last several days, but am confused about something basic. As someone trying to learn text-to-speech models, I'm unclear about how the spectrogram of fixed-length is generated for a input text during training.
Max ground-truth clip length in LJSpeech dataset is 14sec, then wouldn't that indirectly define the max mel_frames in output to be 14(1/0.0125)= 1480=1120 ? What is max_sentence_length of the input-text after padding? I assume all the input sentences used during training and inference would be padded to a max_len, is that correct?
Another related issue which may be a beginner question: After encoder creates 256 hidden states (from 256 bidirectional lstms), isn't the decoder output limited to 256 frames (for output layer reduction factor r=1)? If I understand encoder-decoder correctly, if decoder is producing 1 frame per encoder state as r=1, then how can it produce more frames than encoder states?