mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.43k stars 1.26k forks source link

make a high quality public domain training set using mozilla deepspeech and librivox (idea\enhancement) #34

Closed Motherboard closed 6 years ago

Motherboard commented 6 years ago

As I understand it, the difference between Google's model and the pretrained available here is the quality and size of the training set.

Would it be possible to take a high quality long librivox recording and use mozilla's STT model to pin point the timing of each spoken word (we already have the ground truth text from librivox, so it's only a matter of timing it)?

We could get some tens of hours of single person recording this way.

Does it make sense? How easy is this to accomplish? I could have a go if it's not hard, haven't messed with deepspeech yet, and haven't looked at how the dataset is encoded yet, so I don't know how hard or important it is.

cuuupid commented 6 years ago

Experimenting with a similar idea--could also feed the raw text into another TTS model and generate tons of training data this way. This could boost the model accuracy and coherence, and then we can further condition using WaveNet and retrain on human data.

Google's dataset is apparently around 25 hours so we would need this amount of training data (roughly 4x the amount existing right now).

erogol commented 6 years ago

@Motherboard real difference b/w Google's model and TTS is WaveNet. It gives a huge boost at fidelity. It is kind of a holly grail of TTS systems right now, nobody except Google makes it work for real-time systems. And I believe they use > 25 h data for their deployed system contrary to what they suggest in the paper. Trained TTS models are really weak to generalize the unseen words otherwise, especially if they are not trained with phonemes but characters.

I think it is quite a smart way to segment the data, if it works as you described. If you try this pls let me know the result. However, by the first sight, it is viable way to curate a dataset. tv

@pshah123 Using another TTS system is a sassy way to augment data and it might lead some license issues if you issue it professionally.

erogol commented 6 years ago

@Motherboard regarding the Mozilla Common Voice data, I need some more work here to make TTS more stable before dwelling into data curation. However, it is definitely in the queue.

Motherboard commented 6 years ago

Thanks for the input.

TTS is based on tacotron, right? Google's tacotron model (which, to the best of my knowledge uses Griffin-Lim vocoder, and not wavenet) sounds far superior to any public model I've heard so far (also superior to tacotron 2 public models which utilize wavenet, and r9y9's wavenet model sounds quite good - so I'm not sure it's the thing holding the system back)

By the way, what are the downsides of using tacotron 2 over tacotron? There's a BSD licensed implementation in pytorch of tacotron 2 in NVIDIA's git repository, and there's also r9y9's implementation, why not move to one of these instead of stabilizing a tacotron based application?

Also, Facebook had a paper out showing VoiceLoop (and even char2wav) gets better MOS than tacotron on the publicly available datasets...

By the way, I really like your blog :)

erogol commented 6 years ago

@Motherboard i don't remember the paper exactly but they might be using phonemes for english instead of characters. That might lead the difference. Othwerwise, I am not quite sure what more is. Maybe hyper-parameters, slight engineering tweaks, better data or small bug with our model :)

I did not try Tacotron2. However as I see from the other papers mostly about speaker embedding, I see they use Tacotron over Tacotron2 for some reason.

I started to add changes towards Tacotron2 (https://github.com/mozilla/TTS/issues/26) however it is a slow procession since I like to see the effect of each changes over the results. So far nothing has a promising improvement. My feeling is that Google use Tacotron2 since WaveNet is also able to recover the sacrifice of the architectural change so I am suspicious if Tacotron2 is better with any other vocoder.

I have VoiceLoop implementation as well but they also use phonemes and I cannot make it learn with raw characters. Since use of phonemes is a limiting factor for language transition, I'd prefer to go with Tacotron.

THX :)

erogol commented 6 years ago

No activity here, feel free to reopen