Closed erogol closed 5 years ago
Hey Eren, the link seems to be broken and now the dataset resides here https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
Looking into the m-ailabs dataset right now. Looks like we have 5 different speakers. One male and four female. Also global style tokens should be of use here to capture prosody and train the system. The reading can be quite expressive:
https://vocaroo.com/i/s1nl3RWZrGQO vs. https://vocaroo.com/i/s1xlM1WWRHx8
Also numbers aren't normalized in those files 😏
@twerkmeister numbers are normalized by TTS by default if things are not so strange. MAILABS dataset looks like a good but a hard start since the readings have very big prosody changes through the session. therefore it is hard to model with vanilla TTS but using style tokens makes sense to differentiate these fluctuations.
Yeah doing some tests with the phonemizer right now. Numbers and even things like "G20" work quite well. The things it messes up are definitely dates, ordinals, and some abbreviations:
echo '20.10.2019' | phonemize -l de -b espeak
tsvantsɪç pʊnkt tseːn pʊnkt tsvaɪ taʊzənt nɔøntseːn
echo 'der 60. Geburtstag' | phonemize -l de -b espeak
dɛɾ zɛçtsɪç ɡəbʊɐtstɑːk
echo '1,3 Mrd.' | phonemize -l de -b espeak
aɪns kɔma dɾaɪ ɛmɛɾdeː
But that might be negligible. Gonna run a test on the Angela Merkel subset in that dataset and see what it produces. Also, gonna implement simple multi speaker and run it on German common voice
Also noticing that the automatic sentence segmentation plus alignment aren't perfect. Sometimes first and last words of a sentence slip entirely or partially into neighboring audio files
Ok training seems to be running now, but had to set do_trim_silence
to false, otherwise it would end up with a few empty audio files and crash.
you might need to set silence threshold for do_trim..., but for many datasets, it does not change things much
Also noticing that the automatic sentence segmentation plus alignment aren't perfect. Sometimes first and last words of a sentence slip entirely or partially into neighboring audio files
Could you give me some examples of this? it is interesting!
I've tried one of the German voices from M-AI labs some time ago and adapted a small (600 sentences or so) set using Rayhane-mamahs taco implementation and it worked pretty well.
It can be heard here https://kutinkindlinger.com/le-parleur-radiopiece-52/
The SoundCloud player is a bit tiny but it's there ;)
That's just Griffin Lim though, but for this art project it wasn't too bad a match.
My main point is that the pronounciation is pretty good when training from characters (of course it's a lot easier for German than English)
@erogol here's one example of the endings slipping into the next example: I didn't search for this one, but it was the first two I listened to. The Und of the second sentence is in the first audio file
Next example, I randomly picked: word order got seriously mixed up:
what's being said is
Soldatinnen oder aber auch der Männer, aber vieler Kinder auch höre, [...]
checked some more, almost always first words like Und missing or more:
The beginning Das ist is missing (and found in the previous audio file ^^)
Merkel dataset quality doesn't seem to be great
@twerkmeister I've not seen this much of confusion in alignment. Are you sure, something else is not the culprit? Because, in general, if alignment does not work, it produces gibberish only. It does not change the order of words. This is quite interesting.
So are these audios generated by TTS?
Are you sure the given text to the model is as you provided above?
Do you also have alignment plots?
If these are from TTS, how did you generated these? From the notebook or you wrote your own code?
This is the training dataset :D
I have some first results on the common voice german corpus with multi speaker embedding (trained with 265 speakers that had at least 100 sentences each). It's much better than what I had before, when we met. Interestingly, the training didn't go particularly well until step 200k or so, when it suddenly learned very good attention.
Here a few a synthesized examples of my fav German tongue twister: speaker 0 speaker 1 speaker 64 (female) speaker 188 (fairly clear)
So far I used the notebook locally without graphics card, so it's all synthesized using GL. But will also run it on my server later and see the difference with your pretrained WaveRNN.
@twerkmeister good job! They are preliminarily quite good.
I successfully trained a German Model using m-ai-labs dataset. Not perfect but works reasonably well. One limiting factor is that the dataset has voice talents impersonating book characters and some of the voice recordings have bad alignment between voice clips and transcripts.
I am going to release the model soon and you can see the results below.
Good stuff! Did you use tacotron 1 for this?
Also, as I noted before the phonemizer doesn't properly handle german umlauts. In your example sentence it says Privatssphare instead of Privatssphäre. We might want to add a text cleaner for german that turns ä into ae, ü into ue, and ö into oe. I found that to work better, but it's still not perfect, and some grapheme to phoneme conversions will still be wrong
@twerkmeister yes it is T1
I guess it is easier to train with graphemes in German since text and enunciation are quite aligned and not cumbersome as opposed to some of the English words.
I am going to release the model soon and you can see the results below.
What is the status of the release?
We do not have any dat set for a release since we could not find a good enough dataset to go for.
I'm currently recording/training for my own german tts voice using mimic-recording-studio. Currently I read 11.000 phrases with a total length of 10 hours+.
I published an intermediate result here (using cc0 license) : https://github.com/thorstenMueller/deep-learning-german-tts
I will upload new phrases (ljspeech structure) soon.
Maybe this helps.
Research purpose dataset can be found here http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/
I assume to have better results since German has more coherent pronounciation and transcription.