Open wanshun123 opened 5 years ago
This is a good question. Like you said the quality of synthesized speech depends a lot upon the quality of your data. You will get a high quality TTS, if the data is recorded in a quite environment (anechoic room is recommended) by a professional voice actor. You should also keep in mind that data recording is not one time job, you may need to update or record new sentences based on your application. Most of the industry standard TTS databases are recorded by a professional voice actor, and they update the database based on applications.
In your examples, both Nancy and Blizzard 2013 data were recorded by voice actors and their pronunciations were very clear. Moreover, there is no background noise in both Nancy and Blizzard 2013 data whereas LJ Speech has some background noise.
One thing you could try experimenting with is changing the eval "power" parameter. Since it's only a parameter used for eval, you do not need to retrain a model to change it. Increasing it makes the voice have less robotic tinge but more muffled. You might be able to find a better sweet spot by increasing it slightly, running some eval tests, changing it again, testing again, etc. I would say the default value is pretty good though.
In all my datasets there is no beginning/ending silence, ...
Is it a must?
I have done a lot of training on different self-made datasets (typically having around 3 hours of audio across a few thousand .wav files, all 22050 Hz) using Tacotron, starting from the pretrained LJSpeech model (using the same hyperparameters each time and to a similar number of steps) and am very confused why for some datasets the output audio ends up being very clear for many samples - sometimes even indistinguishable from the actual person speaking - and for other datasets the synthesised audio always has choppy aberrations. In all my datasets there is no beginning/ending silence, transcriptions are all correct, and the datasets have fairly similar phenome distributions.
To take an example from publicly available datasets: on https://keithito.github.io/audio-samples/ one can hear that the model trained on the Nancy Corpus sounds significantly less robotic and is clearer than the model trained on LJ Speech. Here https://syang1993.github.io/gst-tacotron/ is samples for a model trained on Blizzard 2013 on tacotron with extremely good quality compared to any samples I've heard from a model trained on LJ Speech using Tacotron, even though the Blizzard 2013 dataset used there is smaller than LJ Speech. Why might this be?
Any comments appreciated.