Closed fazlekarim closed 6 years ago
@fazlekarim Since it is an unsupervised way to learn the tokens, it's hard to see what the token means when using multi-speakers. I guess it's better to add a speaker embedding to the network as another google's paper.
The GST paper covered this, too. Section 7.2:
Without using any metadata as labels, we train a baseline Tacotron and a 1024-token GST model for comparison. As expected, the baseline fails to learn, since the multi-speaker data is too varied. The GST model results are presented in Figure 7. This shows spectrograms for the same phrase overlaid with F0 tracks, generated by conditioning the model on two randomly chosen tokens. Examining the trained GSTs, we find that different tokens correspond to different speakers. This means that, to synthesize with a specific speaker’s voice, we can simply feed audio from that speaker as a reference signal. See Section 7.3 for more quantitative evaluations
@rryan With that in mind, can we exploit this and use it for voice conversion?
How would the results look like if we combined LJSpeech and Blizzard? Will we get better results?