Using input at train step to evaluation but the result is bad

tuong-olli commented 7 years ago

I use input at train step 129000 to evaluation but step-129000-audio.wav is good, eval-129000-audio.wav is so bad

keithito commented 7 years ago

Can you provide more details about your training data and hyperparameters?

tuong-olli commented 7 years ago

I used your default parameters but my data training is non english (Vietnamese), so I used basic_cleaners and added our own characters in symbols.py . Then at step 131k, I used the same text from the generated audio from training as an input in eval.py, but there are a hug difference between 2 outputs 2outputs.zip

keithito commented 6 years ago

This is probably okay. Training uses teacher forcing to feed the previous step's ground truth output as input when predicting the next output, while at eval time, it's feeding the previous output generated by the model (as there is no ground truth). As a result, it's expected for the outputs during training to be of better quality.

It sounds like it's starting to learn to generate reasonable output, so you may want to keep training for a while. Or you can try to improve your training data. How much training data do you have, and how many symbols are you using in symbols.py?

tuong-olli commented 6 years ago

The model is starting overfit at 160k steps I can only hear 2-3 words from eval's results I use my own training data (1.59 hours , 2000 sentences) And there are 195 symbols in my language (vietnamese) I dont think I'm missing any symbols Since I've already checked everything in my input I dont think I'm missing any symbols Since I've already checked everything in my input It would be nice if you gave me some advices. Thank you in advance!!!

keithito commented 6 years ago

It's possible that 1.59 hours is not enough training data.

ttslr commented 6 years ago

@keithito Excuse Me， How much time should we prepare for the training data？ Thank you very much！ My eval result also poorer than train result.

MXGray commented 6 years ago

@tuong-olli @keithito @imucsliurui

I've tested a few small (45 minutes or so) and large (100++ hours) of English and non-English datasets; and From my results, I'm setting my minimum at 50 hours. Also:

I've tried augmenting my data following my research materials ( here's one >> http://speak.clsp.jhu.edu/uploads/publications/papers/1050_pdf.pdf ); and I recently augmented a 1.25-hour private, non-English dataset to 75 hours using PyDub (very very minor frame / speed / pitch changes) - Results so far are good. :)

P.S. Don't overdo it though if you want to do this to augment your dataset - PyDub octave change of -0.03 to 0.01, on average, is what I'm setting as my limit, though that's for our own private non-English datasets (I'm sure it'll be different with each dataset).

ttslr commented 6 years ago

@MXGray Thank you very much! Can you tell me what language your non English data set is? Can you give me some test example to us?

Thanks again！

MXGray commented 6 years ago

@imucsliurui Tagalog (Filipino) and certain local dialects (Ilokano, Bisaya, Cebuano),. Here's a video (Tagalog) >> https://www.youtube.com/watch?v=1rrEwns3jgw?rel=0

begeekmyfriend commented 6 years ago

@MXGray Would you please open the approach to augmentation of audio dataset by pitch change? I have tried some methods as mentioned in this issue to part of my own dataset (from <10h to 50h+) but the alignment did not appear. By the way, the original whole dataset (30h+) is all right for good alignment.