syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"
368 stars 110 forks source link

How to integrate this code to r9y9's wavenet_vocoder ? #14

Open rishikksh20 opened 6 years ago

rishikksh20 commented 6 years ago

Is there any way to integrate this code with wavenet vocoder ?

syang1993 commented 6 years ago

Hi, the main part of this model is similar to Tacotron1. We can also add the style embedding part to Tacotron2, then integrate it to wavenet to get better results.

rishikksh20 commented 6 years ago

ok, I will modify r9y9/Tacotron-2 code and add your style embedding code in that and then will see hows it's working.

rishikksh20 commented 6 years ago

@syang1993 what about loss function ?

syang1993 commented 6 years ago

@rishikksh20 The style token is trained under an unsupervised way, I guess we don't need extra loss unless you have a specific purpose.

rishikksh20 commented 6 years ago

@syang1993 thanks! Is it possible to integrate Tacotron 1 with wavenet vocoder, as the GST Tacotron paper has mentioned that they have tested it on Wavenet, so I think it is possible.

syang1993 commented 6 years ago

@rishikksh20 I tried to integrate Tacotron1 with wavenet, but the performance is worse than Tacotron2. Though the paper tested it on wavenet, I guess it's easier to do it with tacotron2.

rishikksh20 commented 6 years ago

@syang1993 ok got it , issue with Tacotron 1 might be due to receptive field width. Anyways, regarding Tacotron 2 just adding style embedding part of your code enough (though I can check easily, but training a tacotron 2 took at least a week ), because in GST paper they mentioned some changes in decoder part also.

Sorry to ask you so much questions but it kind an urgent task for me and I have limited computation.

syang1993 commented 6 years ago

@rishikksh20 I'm not sure what the "receptive field width" mean in Tacotron1? I did Tacotron 2 before, I guess it doesn't take so much time to train. But actually if you add style embedding and reference encoder to Tacotron 2, it will take more time. And the decoder part in this repo not perfectly match to the paper, I just try to use the style embedding idea to see how it works. I guess you may don't need to reconstruct the paper's structure all the same, you can modify it with your own purpose.

karamarieliu commented 6 years ago

has anyone tried GST w/ Tacotron 2 and WaveNet? I am working on it now but don't have results yet so this could be all for naught..

rishikksh20 commented 6 years ago

@karamarieliu could you share your work with me, I am also working on this issue.

karamarieliu commented 6 years ago

@rishikksh20 I am currently encounter an evaluation error so I'll post it when that is solved. Rn I have T1 with GST and Wavenet if you wanted that. Still testing it but it runs okay. https://github.com/karamarieliu/gst-tacotron-wavenet

rishikksh20 commented 6 years ago

@karamarieliu means GST-Tacotron 1 with wavenet_vocoder running fine ? Do you have any voice sample of that? Because I tried to integrate gst-tacotron (based on Tacotron 1) with wavenet vocoder but it hasn't performed well. If you have any voice sample which generates spectrogram using gst-tacotron and synthesizes voice using wavenet_vocoder then please share with me. And also in your mentioned repo, you didn't mentioned how to use wavenet-vocoder with gst-tacotron.

rishikksh20 commented 6 years ago

@karamarieliu can you share how to train Wavenet (with gst Tacotron 1) here and how to synthesise audio though I followed the and figure out how to systhesize but it better if you elaborate bit if you have some time, otherwise please share the command to train wavenet if possible.