Most synthesized wavs are garbage

keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

MIT License

2.94k stars 965 forks source link

Most synthesized wavs are garbage #304

Open nikigre opened 4 years ago

nikigre commented 4 years ago

Hi! I have jow upgraded to a much powerful computer. And currently, I am at step 23300. Waws at checkpoints are very good and easy to understand. But if I put almost anything into demo server it is not understandable. And also, most Wav files are 10:25. Why is that? And also, if I put the same sentence I recorded it works, but in the end, it adds weird noise. Thank you for help!

japita-se commented 4 years ago

Show the attention plots, please. And remember that for unseen labels you have to reach at least 300k.

nikigre commented 4 years ago

Hi! Here are the last 4 images and sounds. Plot.zip I am synthesizing Slovenian language. Yes, but I made up a sentence that is build of the same words I know I have recorded. But I almost always get unrecognisable sound. Thank you!

el-tocino commented 4 years ago

Your charts show it has not yet aligned.

nikigre commented 4 years ago

Hi! How many recordings (hours) is minimal for acceptable results? In which language are you working on @el-tocino ? Thank you

el-tocino commented 4 years ago

I use English. I started with about 1000 recordings, which didn't work well, have moved up to 6000.

nikigre commented 4 years ago

Hi @el-tocino! How long are these recordings in sum (minutes/hours)?

el-tocino commented 4 years ago

Average clip length was 3.3s. shortest was .5s, longest 9.6s. Now up to several hours, not sure exactly without counting but at least 6.

nikigre commented 4 years ago

Thank you @el-tocino! How about hparams.py? Did you change anything?

el-tocino commented 4 years ago

Yep, adjusted outputs per step and batch size to fit my gpu. Sample rate to fit my clips (16000). Learning rate I adjusted depending on how many samples I had.

nikigre commented 4 years ago

Hi guys! So after a break, I decided to try again. Now I have made new recordings that should be good. I used LJSpeech dataset as an example. I have 1,9 hours of recordings. More are in creation. I have run the training process and currently, I am at step37150 and graph is still empty. What am I doing wrong? I have no idea. I am a bit desperate here. step-37150-audio.wav sounds a bit robotic, but it is understandable. But the demo server does not synthesise anything. Here are my hparams.txt

step-37150-align

Thank you for your help!

nikigre commented 4 years ago

Hi! Does anyone have any suggestions? Thank you!

nikigre commented 4 years ago

Hi! @el-tocino do you have any suggestions?

el-tocino commented 4 years ago

You're not aligning. Probably bad data. Look up nmstoker's data plotter tool on the mozilla tts repo to see how your dataset maps out, maybe try that repo instead as well.