Closed ghost closed 5 years ago
High fidelity? I suggest you try NVIDIA open source waveglow
Thanks for the suggestion! I don't have the hardware to run waveglow, but it led me to another project which I think will do what I want it to do. https://github.com/twidddj/tf-wavenet_vocoder It seems that you train it on your dataset and then you'll be able to give it audio files that are generated from tacotron and it will improve that audio. It has some examples showing just that with the LJSpeech dataset. I'll post my results here when I have time to test it.
Unfortunately that vocoder is also going to take a very long time to train and a long time to generate sentences. I instead decided to refine my dataset and also try a method someone mentioned about slightly editing pitch, speed, tempo, etc to essentially duplicate the dataset. That means that dataset is double the size with all the same sentences except some slight modifications to the audio. I'm curious if that will help with what the network learns. I'm going to close this and eventually post the best results I was able to achieve later.
I decided to try training a voice that I took from YouTube videos. I made the dataset myself and I'm still adding to it. It only has about 40 minutes of audio in 900 clips. It's a male speaking English. I then trained on top of the recent LJ-Speech model that was posted. After about 10k steps, the audio sounds pretty good! I can easily tell who the voice belongs to. The sentence structure is really great. Most words are understandable.
The only issue is that it has more of a robotic tinny sound than some of the other examples I've heard. I know it can never fully go away, but I feel there's room for improvement. My question is what would help with getting rid of that sound? Should I increase the dataset? Should I train longer? Are there any parameters I should mess with that would potentially help? I would like to do what I can to further improve it if possible.
If anyone has suggestions, let me know.