syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"
368 stars 110 forks source link

poor alignment when conditioned on reference audios #20

Open mohsinjuni opened 5 years ago

mohsinjuni commented 5 years ago

First of all, thanks very much for taking out time to implement this. I have listened to Audio Samples here and the results are amazing. However, I am unable to replicate this behavior. Could you please help?

I have trained gst-tacotron for 200K steps on LJSpeech-1.1 with default hyperparameters SampleAudios.zip . Encoder-decoder align well during the training. However, during inference when conditioned on unseen reference audio (I used 2nd target reference audio from here), the alignment does not hold.

Following is from training step#200000

step-200000-align

However, when I evaluated with 203K checkpoint, conditioned on reference audio discussed above, I get following.

eval-203000_ref-sample2-align

Without conditioning, (i.e., random)

eval-203000_ref-randomweight-align

Even, style-transfer in voice does not make much difference.

Please find attached zipped file for voice samples.

My Questions:

syang1993 commented 5 years ago

Hi, I guess there is a mismatch between your reference and training audio. In my demo page, I trained the model with Blizzard2011 data. Before that, I random select 500 sentences as test set, which is used to provide reference audio. So for your experiment, the reference audio is from Blizzard2011 database, but your model was trained with LJspeech data.

I'm sorry that now I'm an intern at Tencent, I don't have the pre-trained model now.

mohsinjuni commented 5 years ago

Hi.. Thanks for your quick response. I was under the impression that I can use any reference-audio (as a style) and use the model to generate new voice in the referenced-audio style. Does it matter which reference I use? Does it have to be from the same distribution as training data? My assumption was that model learns training-data distribution automatically and generates new audio/wav files with the style given in the training data. Please correct me if I am wrong. Thanks again for your help.

liangshuang1993 commented 5 years ago

@mohsinjuni Hi, I'm in the same situation like yours, have you figured out if reference audio can be any audio or must be in the same dataset? Thanks.

shrinidhin commented 4 years ago

Hi, I have a similar question. Has anyone found a solution for the same?