Closed MorganCZY closed 3 years ago
Hi @MorganCZY !
First, we annotated the pitch and text for the mel over time, so the text, mel, and pitch inputs are all aligned. Also, when downsample the mel spectrogram by 4 times, the text and pitch input was processed with same time-resolution. Therefore, the lengths of text, mel, and pitch are all the same.
The attention module was designed with the expectation that it will play a role of catching the minute errors of what we have annotated, and we also want to maintain the structure of DCTTS (https://arxiv.org/pdf/1710.08969.pdf) which was the reference model. So, if the model is well trained, the result of attention should be almost diagonal.
Unlike TTS, in the case of Singing synthesis, since it is common to give duration information to the text input together, we entered it using the duration information of each text obtained from the score during the training process. For more details, please refer to the paper's input representation method.
The lengths of M, E_MQ, E’_M, D_M, and M^ are all 1/4 of the length of S. Also, T and P are the same lengths.
Got it! BTW, have you tried to use the full time resolution to do this training? Can it then be successful?
We've done rough experiments to model a waveform with a higher sampling rate (for example, to create a 44kHz sound source instead of 22kHz, modeling a longer spectrogram) that didn't work right away. Perhaps, in order to experiment on full-time resolution, factors such as receptive field should be considered carefully.
Mel-spectrograms are downloaded to the quarter mentioned in your paper. May I ask then how this model to learn text and mel alignment by attention model, as well as to take use of pitch condition by mel decoder model? Intuitively, I think there will be a disordered relationship at time axis between mel and pitch, even has an influence between text and mel alignment. In this picture, the time dimention of M, E_MQ, E'_M, D_M, M^ are all in 1/4 of the whole length?