the result obtained by eval_model or synthesis is much worse than which is obtained by train process

r9y9 / deepvoice3_pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

https://r9y9.github.io/deepvoice3_pytorch/

Other

1.97k stars 485 forks source link

the result obtained by eval_model or synthesis is much worse than which is obtained by train process #201

Open Eleanor456 opened 4 years ago

Eleanor456 commented 4 years ago

when I generated the audio by the checkpoint with 32000 steps, the output was pure noise. And the alignment pictures are always empty as following. How can I get the result close normal sound which obtained during training.

step000034000_text1_multispeaker10_alignment

marianbasti commented 4 years ago

What datasets and presets are you using?

Eleanor456 commented 4 years ago

您正在使用哪些数据集和预设？

Chinese datasets with 61 speakers, and the preset I have modified according to the deepvoice3_vctk.json

marianbasti commented 4 years ago

What frontend selected? I'm trying to train on spanish speakers and the results are a litte gibberish, but not noise.

Eleanor456 commented 4 years ago

What frontend selected? I'm trying to train on spanish speakers and the results are a litte gibberish, but not noise.

I convert the transcript to pinyin form, so I selected the en frontend. I think the bad result may be the epochs is not enough.

marianbasti commented 4 years ago

Shouldn't be so noisy. This is what i get with 40000 steps on 13 speaker dataset. step000040000_text3_multispeaker10_alignment

es frontend, so no phonetics dictionary

Eleanor456 commented 4 years ago

Shouldn't be so noisy. This is what i get with 40000 steps on 13 speaker dataset.

es frontend, so no phonetics dictionary

This is the result after training for 61000 steps with batch size of 64.

It is slightly better than before, so I plan to continue training and observe the result.

marianbasti commented 4 years ago

Please let me know how well it goes with that batch size

JohnHerry commented 3 years ago

The same problem. I am using the MAGICDATA dataset, 1016 speakers, training at 1500,000~2000,000 steps got good result in trainging process. but the inference with these two model got bad speech. @Eleanor456 Is your model good right now?