Are the speaker encoder models of the base tts models and tone color converter model be the same model structure? Is there any connection between base tts models and tone color converter model?
During training, for text-audio pair <x, y>, are the reference speaker audio, the output of tone color converter model (speech with
reference tone color and controlled styles) and g from both flow and reverse flow all from y?
Would you plan to release the codes of the training parts, we still could not train a good model following your paper.
Thanks a lot
Hi, I have some questions as belows: