syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"
368 stars 110 forks source link

What would happen if we merged datasets? #5

Closed fazlekarim closed 6 years ago

fazlekarim commented 6 years ago

How would the results look like if we combined LJSpeech and Blizzard? Will we get better results?

syang1993 commented 6 years ago

@fazlekarim Since it is an unsupervised way to learn the tokens, it's hard to see what the token means when using multi-speakers. I guess it's better to add a speaker embedding to the network as another google's paper.

rryan commented 6 years ago

The GST paper covered this, too. Section 7.2:

Without using any metadata as labels, we train a baseline Tacotron and a 1024-token GST model for comparison. As expected, the baseline fails to learn, since the multi-speaker data is too varied. The GST model results are presented in Figure 7. This shows spectrograms for the same phrase overlaid with F0 tracks, generated by conditioning the model on two randomly chosen tokens. Examining the trained GSTs, we find that different tokens correspond to different speakers. This means that, to synthesize with a specific speaker’s voice, we can simply feed audio from that speaker as a reference signal. See Section 7.3 for more quantitative evaluations

fazlekarim commented 6 years ago

@rryan With that in mind, can we exploit this and use it for voice conversion?