r9y9 / deepvoice3_pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models
https://r9y9.github.io/deepvoice3_pytorch/
Other
1.97k stars 485 forks source link

the result obtained by eval_model or synthesis is much worse than which is obtained by train process #201

Open Eleanor456 opened 4 years ago

Eleanor456 commented 4 years ago

when I generated the audio by the checkpoint with 32000 steps, the output was pure noise. And the alignment pictures are always empty as following. How can I get the result close normal sound which obtained during training.

step000034000_text1_multispeaker10_alignment

marianbasti commented 4 years ago

What datasets and presets are you using?

Eleanor456 commented 4 years ago

您正在使用哪些数据集和预设?

Chinese datasets with 61 speakers, and the preset I have modified according to the deepvoice3_vctk.json

marianbasti commented 4 years ago

What frontend selected? I'm trying to train on spanish speakers and the results are a litte gibberish, but not noise.

Eleanor456 commented 4 years ago

What frontend selected? I'm trying to train on spanish speakers and the results are a litte gibberish, but not noise.

I convert the transcript to pinyin form, so I selected the en frontend. I think the bad result may be the epochs is not enough.

marianbasti commented 4 years ago

Shouldn't be so noisy. This is what i get with 40000 steps on 13 speaker dataset. step000040000_text3_multispeaker10_alignment

es frontend, so no phonetics dictionary

Eleanor456 commented 4 years ago

Shouldn't be so noisy. This is what i get with 40000 steps on 13 speaker dataset. step000040000_text3_multispeaker10_alignment

es frontend, so no phonetics dictionary

This is the result after training for 61000 steps with batch size of 64. image

It is slightly better than before, so I plan to continue training and observe the result.

marianbasti commented 4 years ago

Please let me know how well it goes with that batch size

JohnHerry commented 3 years ago

The same problem. I am using the MAGICDATA dataset, 1016 speakers, training at 1500,000~2000,000 steps got good result in trainging process. but the inference with these two model got bad speech. @Eleanor456 Is your model good right now?