keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.96k stars 956 forks source link

Confused about how to specify max mel_frames in the output spectrogram and training audio sample length in hparams.py #335

Open jjoe1 opened 4 years ago

jjoe1 commented 4 years ago

First thanks for this detailed original tacotron model and the wiki.

I've been trying to read the wiki, this code, and the Tacotron paper (https://arxiv.org/pdf/1703.10135.pdf) for the last several days, but am confused about something basic. As someone trying to learn text-to-speech models, I'm unclear about how the spectrogram of fixed-length is generated for a input text during training.

  1. Max ground-truth clip length in LJSpeech dataset is 14sec, then wouldn't that indirectly define the max mel_frames in output to be 14(1/0.0125)= 1480=1120 ? What is max_sentence_length of the input-text after padding? I assume all the input sentences used during training and inference would be padded to a max_len, is that correct?

  2. Another related issue which may be a beginner question: After encoder creates 256 hidden states (from 256 bidirectional lstms), isn't the decoder output limited to 256 frames (for output layer reduction factor r=1)? If I understand encoder-decoder correctly, if decoder is producing 1 frame per encoder state as r=1, then how can it produce more frames than encoder states?