The training of CTC-Attention phoneme recognizer, speaker encoder, and Vocoder. Above three can be trained separately on their own.
The training of seq2seqMoL, it will need the output from CTC-Attention phoneme recognizer and speaker encoder. Each training instance is like (A's sentence_1, B's sentence_x, B's sentence_1), MSE is computed between the model's output B's sentence_1 and the ground truth B's sentence_1.
Hey, i am back again with another question :P
Can I interpret the two-stage training scheme as:
Please correct me if i am wrong.
Thanks!