Open toannhu opened 6 years ago
You can try larger dataset.
I think you need at least about 16 hours's data with single speaker or more with multi speakers.
@wotulong Can I ask you something?
Well, I have two corpus with one female voice and one male voice (uploaded below). Is this possible to train this repo with multi speakers? It is hard to alignment with multi speakers than training with single speaker, isn't it? And if it is possible, two corpus must be both same female/male voice or one male voice and one female voice is OK?
Second thing I want to ask is the difference between training dataset with 22.05kHz and 16kHz. Is it affected to the alignment speed?
Anyway, I just have read this thing in Kyuubong and he said that >= 5 hours is OK.
Really thanks for your help!
I think voice with 16KHz may fast in convergenceļ¼but you have to test it. In my experience, the clearly every word spoken in the dataset , the less data you needed to get a good result, and I think 5 hours is not enough.
I've tried to train with my own non-English dataset (~ 3 hours, every wav is from 5 to 8 seconds) but the alignment seems very strange.
In the preprocessing step, I've used
basic_cleaners
and already trimmed leading and trailing silence. Also the audio samples don't have any background noise and I don't change anything in hparams.py exceptmax_iters=300
.Synthesis sound is hardly regconized
sound.zip
Here is my training log. I'm really appreciate your help.
train.log