mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.23k stars 1.24k forks source link

Train German Model #93

Closed erogol closed 5 years ago

erogol commented 5 years ago

Research purpose dataset can be found here http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/

I assume to have better results since German has more coherent pronounciation and transcription.

twerkmeister commented 5 years ago

Hey Eren, the link seems to be broken and now the dataset resides here https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/

twerkmeister commented 5 years ago

Looking into the m-ailabs dataset right now. Looks like we have 5 different speakers. One male and four female. Also global style tokens should be of use here to capture prosody and train the system. The reading can be quite expressive:

https://vocaroo.com/i/s1nl3RWZrGQO vs. https://vocaroo.com/i/s1xlM1WWRHx8

twerkmeister commented 5 years ago

Also numbers aren't normalized in those files 😏

erogol commented 5 years ago

@twerkmeister numbers are normalized by TTS by default if things are not so strange. MAILABS dataset looks like a good but a hard start since the readings have very big prosody changes through the session. therefore it is hard to model with vanilla TTS but using style tokens makes sense to differentiate these fluctuations.

twerkmeister commented 5 years ago

Yeah doing some tests with the phonemizer right now. Numbers and even things like "G20" work quite well. The things it messes up are definitely dates, ordinals, and some abbreviations:

echo '20.10.2019' | phonemize -l de -b espeak
tsvantsɪç pʊnkt tseːn pʊnkt tsvaɪ taʊzənt nɔøntseːn
echo 'der 60. Geburtstag' | phonemize -l de -b espeak
dɛɾ zɛçtsɪç ɡəbʊɐtstɑːk
echo '1,3 Mrd.' | phonemize -l de -b espeak
aɪns kɔma dɾaɪ ɛmɛɾdeː

But that might be negligible. Gonna run a test on the Angela Merkel subset in that dataset and see what it produces. Also, gonna implement simple multi speaker and run it on German common voice

twerkmeister commented 5 years ago

Also noticing that the automatic sentence segmentation plus alignment aren't perfect. Sometimes first and last words of a sentence slip entirely or partially into neighboring audio files

twerkmeister commented 5 years ago

Ok training seems to be running now, but had to set do_trim_silence to false, otherwise it would end up with a few empty audio files and crash.

erogol commented 5 years ago

you might need to set silence threshold for do_trim..., but for many datasets, it does not change things much

erogol commented 5 years ago

Also noticing that the automatic sentence segmentation plus alignment aren't perfect. Sometimes first and last words of a sentence slip entirely or partially into neighboring audio files

Could you give me some examples of this? it is interesting!

m-toman commented 5 years ago

I've tried one of the German voices from M-AI labs some time ago and adapted a small (600 sentences or so) set using Rayhane-mamahs taco implementation and it worked pretty well.

It can be heard here https://kutinkindlinger.com/le-parleur-radiopiece-52/

The SoundCloud player is a bit tiny but it's there ;)

That's just Griffin Lim though, but for this art project it wasn't too bad a match.

My main point is that the pronounciation is pretty good when training from characters (of course it's a lot easier for German than English)

twerkmeister commented 5 years ago

@erogol here's one example of the endings slipping into the next example: I didn't search for this one, but it was the first two I listened to. The Und of the second sentence is in the first audio file

Next example, I randomly picked: word order got seriously mixed up:

what's being said is

Soldatinnen oder aber auch der Männer, aber vieler Kinder auch höre, [...]

checked some more, almost always first words like Und missing or more:

The beginning Das ist is missing (and found in the previous audio file ^^)

Merkel dataset quality doesn't seem to be great

erogol commented 5 years ago

@twerkmeister I've not seen this much of confusion in alignment. Are you sure, something else is not the culprit? Because, in general, if alignment does not work, it produces gibberish only. It does not change the order of words. This is quite interesting.

So are these audios generated by TTS?

Are you sure the given text to the model is as you provided above?

Do you also have alignment plots?

If these are from TTS, how did you generated these? From the notebook or you wrote your own code?

twerkmeister commented 5 years ago

This is the training dataset :D

twerkmeister commented 5 years ago

I have some first results on the common voice german corpus with multi speaker embedding (trained with 265 speakers that had at least 100 sentences each). It's much better than what I had before, when we met. Interestingly, the training didn't go particularly well until step 200k or so, when it suddenly learned very good attention.

Here a few a synthesized examples of my fav German tongue twister: speaker 0 speaker 1 speaker 64 (female) speaker 188 (fairly clear)

So far I used the notebook locally without graphics card, so it's all synthesized using GL. But will also run it on my server later and see the difference with your pretrained WaveRNN.

erogol commented 5 years ago

@twerkmeister good job! They are preliminarily quite good.

erogol commented 5 years ago

I successfully trained a German Model using m-ai-labs dataset. Not perfect but works reasonably well. One limiting factor is that the dataset has voice talents impersonating book characters and some of the voice recordings have bad alignment between voice clips and transcripts.

I am going to release the model soon and you can see the results below.

voice example

de_alingment

twerkmeister commented 5 years ago

Good stuff! Did you use tacotron 1 for this?

Also, as I noted before the phonemizer doesn't properly handle german umlauts. In your example sentence it says Privatssphare instead of Privatssphäre. We might want to add a text cleaner for german that turns ä into ae, ü into ue, and ö into oe. I found that to work better, but it's still not perfect, and some grapheme to phoneme conversions will still be wrong

erogol commented 5 years ago

@twerkmeister yes it is T1

I guess it is easier to train with graphemes in German since text and enunciation are quite aligned and not cumbersome as opposed to some of the English words.

davidak commented 4 years ago

I am going to release the model soon and you can see the results below.

What is the status of the release?

erogol commented 4 years ago

We do not have any dat set for a release since we could not find a good enough dataset to go for.

thorstenMueller commented 4 years ago

I'm currently recording/training for my own german tts voice using mimic-recording-studio. Currently I read 11.000 phrases with a total length of 10 hours+.

I published an intermediate result here (using cc0 license) : https://github.com/thorstenMueller/deep-learning-german-tts

I will upload new phrases (ljspeech structure) soon.

Maybe this helps.