syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"
368 stars 110 forks source link

Training with custom data #23

Open wanshun123 opened 5 years ago

wanshun123 commented 5 years ago

Curious if others have achieved reasonable results training on custom data. I've tried training the model on data from https://github.com/aomv/voiceloop-in-the-wild-experiments/tree/master/data/donald-trump/data (which has audio files and transcriptions of a few seconds in length, for somewhere around a couple hours in total) making a metadata.csv file in the same format as the LJSpeech dataset.

While I've trained for several hours with a steadily decreasing loss, the graph would indicate the model is not learning properly. I've also failed to generate intelligible audio at least without using a reference audio (trying several times).

step-34000-align

eval-34000_ref-randomweight-align

syang1993 commented 5 years ago

@wanshun123 Hi, I cannot open the data link to check the quality of data. I tried different data sets before and found it works.

Besides, the attention used in this repo is a very basic one, which is not so good to generate long sentences.

iamanigeeit commented 3 years ago

@wanshun123 Did you train using use_gst=False? I have the same issue when use_gst=False but not when True.

@syang1993 In my case the audio seems intelligible, although not good quality. I am using the Emotional Speech Dataset from https://hltsingapore.github.io/ESD/download.html

The English data shows similar attention "collapse". The Chinese data is ok.

step-190000-align