keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.96k stars 957 forks source link

Strange Alignment #127

Open toannhu opened 6 years ago

toannhu commented 6 years ago

I've tried to train with my own non-English dataset (~ 3 hours, every wav is from 5 to 8 seconds) but the alignment seems very strange.

step-50000-align

step-57000-align

step-65000-align

In the preprocessing step, I've used basic_cleaners and already trimmed leading and trailing silence. Also the audio samples don't have any background noise and I don't change anything in hparams.py except max_iters=300.

Synthesis sound is hardly regconized

sound.zip

Here is my training log. I'm really appreciate your help.

train.log

liyz15 commented 6 years ago

You can try larger dataset.

wotulong commented 6 years ago

I think you need at least about 16 hours's data with single speaker or more with multi speakers.

toannhu commented 6 years ago

@wotulong Can I ask you something?

Well, I have two corpus with one female voice and one male voice (uploaded below). Is this possible to train this repo with multi speakers? It is hard to alignment with multi speakers than training with single speaker, isn't it? And if it is possible, two corpus must be both same female/male voice or one male voice and one female voice is OK?

Second thing I want to ask is the difference between training dataset with 22.05kHz and 16kHz. Is it affected to the alignment speed?

voice.zip

Anyway, I just have read this thing in Kyuubong and he said that >= 5 hours is OK.

1 1

Really thanks for your help!

wotulong commented 6 years ago

I think voice with 16KHz may fast in convergenceļ¼Œbut you have to test it. In my experience, the clearly every word spoken in the dataset , the less data you needed to get a good result, and I think 5 hours is not enough.