keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.95k stars 960 forks source link

Using input at train step to evaluation but the result is bad #42

Open tuong-olli opened 7 years ago

tuong-olli commented 7 years ago

I use input at train step 129000 to evaluation but step-129000-audio.wav is good, eval-129000-audio.wav is so bad

keithito commented 7 years ago

Can you provide more details about your training data and hyperparameters?

tuong-olli commented 7 years ago

I used your default parameters but my data training is non english (Vietnamese), so I used basic_cleaners and added our own characters in symbols.py . Then at step 131k, I used the same text from the generated audio from training as an input in eval.py, but there are a hug difference between 2 outputs 2outputs.zip

keithito commented 6 years ago

This is probably okay. Training uses teacher forcing to feed the previous step's ground truth output as input when predicting the next output, while at eval time, it's feeding the previous output generated by the model (as there is no ground truth). As a result, it's expected for the outputs during training to be of better quality.

It sounds like it's starting to learn to generate reasonable output, so you may want to keep training for a while. Or you can try to improve your training data. How much training data do you have, and how many symbols are you using in symbols.py?

tuong-olli commented 6 years ago

The model is starting overfit at 160k steps I can only hear 2-3 words from eval's results I use my own training data (1.59 hours , 2000 sentences) And there are 195 symbols in my language (vietnamese) I dont think I'm missing any symbols Since I've already checked everything in my input I dont think I'm missing any symbols Since I've already checked everything in my input It would be nice if you gave me some advices. Thank you in advance!!!

keithito commented 6 years ago

It's possible that 1.59 hours is not enough training data.

ttslr commented 6 years ago

@keithito Excuse Me, How much time should we prepare for the training data? Thank you very much! My eval result also poorer than train result.

MXGray commented 6 years ago

@tuong-olli @keithito @imucsliurui

I've tested a few small (45 minutes or so) and large (100++ hours) of English and non-English datasets; and From my results, I'm setting my minimum at 50 hours. Also:

I've tried augmenting my data following my research materials ( here's one >> http://speak.clsp.jhu.edu/uploads/publications/papers/1050_pdf.pdf ); and I recently augmented a 1.25-hour private, non-English dataset to 75 hours using PyDub (very very minor frame / speed / pitch changes) - Results so far are good. :)

P.S. Don't overdo it though if you want to do this to augment your dataset - PyDub octave change of -0.03 to 0.01, on average, is what I'm setting as my limit, though that's for our own private non-English datasets (I'm sure it'll be different with each dataset).

ttslr commented 6 years ago

@MXGray Thank you very much! Can you tell me what language your non English data set is? Can you give me some test example to us?

Thanks again!

MXGray commented 6 years ago

@imucsliurui Tagalog (Filipino) and certain local dialects (Ilokano, Bisaya, Cebuano),. Here's a video (Tagalog) >> https://www.youtube.com/watch?v=1rrEwns3jgw?rel=0

begeekmyfriend commented 6 years ago

@MXGray Would you please open the approach to augmentation of audio dataset by pitch change? I have tried some methods as mentioned in this issue to part of my own dataset (from <10h to 50h+) but the alignment did not appear. By the way, the original whole dataset (30h+) is all right for good alignment.

rafaelvalle commented 6 years ago

@begeekmyfriend can you post images of alignment and predicted output during training and at different iterations?

begeekmyfriend commented 6 years ago

align-21000.zip @rafaelvalle Here is the image of alignment. The output evaluation wave files are all noises. I am afraid we cannot find out some clue in them. The dataset contains the original audio of THCHS30 and pitch changed audio. The time length exceeds 50h.

rafaelvalle commented 6 years ago

@begeekmyfriend the dataset used by @keithito in this repo has 24h of audio without augmentation produces good results. this is evidence of how much data one needs. I see diagonal-like lines start to appear in your latest alignments. Can you try training for twice as long, i.e. 50k iterations? or try training without the augmented data if you have ~24h of non-augmented data? last, can you confirm that you're encoding code is correct?

rafaelvalle commented 6 years ago

@begeekmyfriend as a side question, how long does it take you in seconds for each iteration?

rafaelvalle commented 6 years ago

@MXGray @tuong-olli @imucsliurui Let's remember that @keithito produces good results with a ~24h dataset, LJ-speech dataset...

begeekmyfriend commented 6 years ago

@rafaelvalle About 1.8 second per iteration with GTX 1080 Ti NVIDIA card. The original dataset without augmentation shows good alignment. I have posted it at #118 . By the way the alignment appears at 15~20K steps typically when I tried several times. So I am sure if there is no alignment showed before 20K steps, we can terminate the total training. And someone who has tried part of THCHS30 with ~26h showing a very obscure alignment has posted here. Therefore I am afraid there is no unified absolute time length for all dataset. I have also tried copying the original dataset but nothing worked. So I just wonder how does the approach to audio augmentation should be.

rafaelvalle commented 6 years ago

@begeekmyfriend From what I understand, you were able to train using THCHS30 without augmentation but not with augmentation, and you want to know what the approach to augmentation should be. I also think you believe that one should stop training if the alignment doesn't appear before 20k iterations, regardless of the dataset and its size. Let me know if these things are correct...

begeekmyfriend commented 6 years ago

Hey, I have just tried today morning two different Chinese mandarin dataset (20h THCHS30 pluses my own 10h dataset) and merged them together. The alignment appears at 17K steps. It seems the augmentation does not work here.

tuong-olli commented 6 years ago

Does your metadata.csv file include all language: Tagalog, Ilokano, Bisaya, Cebuano?

On Thu, Sep 28, 2017 at 4:08 PM, MXGray notifications@github.com wrote:

@imucsliurui https://github.com/imucsliurui Tagalog (Filipino) and certain local dialects (Ilokano, Bisaya, Cebuano),. Here's a video (Tagalog) >> https://www.youtube.com/watch?v=1rrEwns3jgw

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/keithito/tacotron/issues/42#issuecomment-332776474, or mute the thread https://github.com/notifications/unsubscribe-auth/AZSXhA3O1ssXWr_6izh9fw-hIccbj4Prks5sm2H0gaJpZM4PVw7I .