Is there a tuto, to create from scratch, a homemade model ?

elpimous commented 6 years ago

Hi all. This project is very interesting, and I'd like to create a french voice for ubuntu (ubuntu is very poor in french male TTS)

how many wav sentences do I need to create correct voice ?
What is the ratio, between text request, and vocal answer ? (realtime ? working on a nvidia TX2 board)
could anyone write a small tuto, to help creating a model ?

ps : Using Deepspeech, for some french STT models, but missing TTS !!

Thanks all

Vincent

actarus33 commented 6 years ago

Hi vincent, We are looking for the same thing. And i'm french too.

To create a correct voice, we need at least 20 hours recorded audio with transcription. but the audio quality is very important (garbage in -> Garbage out). I'm far from that goal. I've started to think to try to segment audio books (librivox https://librivox.org/search?primary_key=2&search_category=language&search_page=1&search_form=get_results) to create a french corpus. There's also voxforge where we can access several hours of french transcripted audio (but i have doubts about the quality).... My empire for DATAs !
I am training the LJSpeech-1.0 default dataset (english). 70K steps for now... on a NVIDIA GTX 1070 (0.8 sec / step). I have already nice results when i test some checkpoint models. Using the GPU, the vocal answer is almost in real time (less than 1 second). With CPU, that's another story...
To create a model, we first need DATA, a huge quantity of data. that's the most important. After that, we have to train it.
Do you have good results with DeepSpeech with a french STT model? Can you share the model you used for Deepspeech? i'm looking for a good model.

elpimous commented 6 years ago

Hi Actarus33, Bordeaux ?? Thanks for explications. 20 hours ?? it seems to be lot of material, for a unique voice !! I anderstand about audio quality ! thanks for time tests.

About Deepspeech, I have very good results, but in limited model : I work on some specific models, with restricted sentences to fit a vocabulary file, for a robotic usage. And the model is only trained with my own voice, nearly 5hours. So inferences can only be made about my voice. There is 2 bdd for french material : voxforge, and the corpus-ted-lium. I'll have a look in librivox.

I'm interested in your french material to create a good TTS (my objective) How many hourd do you have ? my mail : elpimous12@gmail.com

actarus33 commented 6 years ago

Yes i'm from Bordeaux, and you?

I've made significative moves since earlier because now i have a very good French speech corpus recorded in studio by a profesionnal :-) segmented audio with transcription ! LE GRAAL ! About 12 hours segmented High quality audio from a nice female voice. I'm confident it will be sufficient because quality is insane.

Ok for DeepSpeech model, i understand. I also loves robotics ;-)

I will start to train the french corpus in few hours and see what can i get from it

actarus33 commented 6 years ago

@elpimous i will keep you informed if i get nice result - i will share results

elpimous commented 6 years ago

Hi Actarus33. Ok thanks a lot. My deepspeech monospeaker has a 0.36% wer (99.64% accuracy Enormous !!!)

Working in LIUM corpus for stt next fr model. I'll try a Tacotron model too, of course.

My small board Jetson tx2 has a good GPU, but low CPU. Hope I'll be able to train voice..

gloriouskilka commented 6 years ago

Hello!

I have a small dataset manually collected from YT videos. Speaker speech's speed (spe spe spe) is variating from slow to suddenly very fast. I divide the speech by these speed change points, and eventually I got a lot of small sequences of words with a very bit of whole sentences.

My idea is to 'synthetically' extend dataset with copies of wavs which will have sligthly different speed than original wavs.

Have someone tried this approach before? Is this a good idea to try?

JoranDox commented 6 years ago

What the author did was take one or more audiobooks from https://librivox.org/ from the same speaker and cut it into several short parts to have a lot of input (LJ-Speech), I did the same with dutch and while I don't have conclusive results yet, it seems to be aligning (at around 30k+ iterations, so slower, but still.. )

actarus33 commented 6 years ago

In my opinion, we have to work with very high quality corpus. The results depends on it. I have tried with segmented audiobooks without success.

The best results i got was with a professionnal studio quality dataset. I made it with a 12hours french corpus. you can listen my results on this page:

http://www.x90x90x90.com/deeplearning-mise-en-oeuvre-du-modele-tacotron-pour-la-synthese-de-la-parole-en-francais

entn-at commented 6 years ago

@actarus33 Did you use the SIWIS database?

actarus33 commented 6 years ago

Yes i did use it, also trained different corpus: segmented audiobooks and corpus generated from youtube videos.

JpEncausse commented 6 years ago

Hi @actarus33 @elpimous I'm also French and would be very interesting about the state of your work. Do you have a repo ?

I was wondering if Azure Video Indexer could help at creating annotated corpus ? You can use Podcast to split audio speaker and get speech to text. (I think Cedric Bonnet GeekInc would be so happy to have a digitalized voice)

brebetez commented 6 years ago

Hello @actarus33 & @JpEncausse , I'm Swiss and need to make a voice in French. Unfortunattely, I'm not able to configure correctly to have a good voice and I have some questions. Maybe you can help me.

How did you make with the metadata? Did you keep the special characters or code it (for exemple: La rÃ©solution votÃ©e OR La résolution votée)
How did you configure the file "symbols.py" under the folder text? With "Ã©" or "é"?
Which cleaners inside the hparams did you use?
@actarus33 : What is the status of your try with TACOTRON 2? It is allready usable?

By the way, @actarus33, great article on your blog!

Thank you for your support and best regards

jjerphan commented 5 years ago

Bonjour @actarus33, @elpimous, @JpEncausse, @brebetez,

I am also French and interested to get access to a trained model and to contribute — if possible — to this project.

Thanks ! 🤸

Bobb10 commented 5 years ago

I am interested in building and training such model, but with Dutch voice instead of French voice. Anyone any idea how to address this?

JoranDox commented 5 years ago

@Bobb10 I've tried with dutch (cleaning & cutting up an audiobook, training for a bajillion iterations, etc) and the best results were with starting from the pre-trained english voice and then training a few iterations (can't remember how many exactly, but less than 10% of total) on the dutch voice.

This worked relatively well (it sounded, as expected, as a mix between a robot, the female english voice and the male dutch voice I used), apart from the fact that you need to train with the same symbols as the english version did (i.e. all the 26 regular letters, and nothing like éèàö).

I never got any really usable results by training from scratch IIRC (it's been a while).

Hope this helps anyone.

Bobb10 commented 5 years ago

@JoranDox Thanks for the comment. My idea is to replace the US English dataset with speech from a single professional female speaker by Dutch dataset with speech from a single professional male speaker (dataset still under construction). Could you please send an email to bob_goertz@hotmail.com with details about your attempt as described here? Maybe we can help and contribute to each other?

@Bobb10 I've tried with dutch (cleaning & cutting up an audiobook, training for a bajillion iterations, etc) and the best results were with starting from the pre-trained english voice and then training a few iterations (can't remember how many exactly, but less than 10% of total) on the dutch voice.

This worked relatively well (it sounded, as expected, as a mix between a robot, the female english voice and the male dutch voice I used), apart from the fact that you need to train with the same symbols as the english version did (i.e. all the 26 regular letters, and nothing like éèàö).

I never got any really usable results by training from scratch IIRC (it's been a while).

Hope this helps anyone.

JoranDox commented 5 years ago

I don't know if I have much left to share, as at the point where it started to become okay, google started to upgrade and roll out many more voices (including dutch), not mentioning amazon polly and other text-to-speech generators that were becoming increasingly good. I've deleted the datasets and archived most of my code since I've been working on different (totally unrelated) projects since. On top of that, my code is technically property of my company so I'm not sure how much I could share here if I found it again.

That said, most of the work was put into creating the dataset (both cutting the audiobook and cleaning the text that came with it, and updating the text to modern dutch), but I believe I didn't have clean enough sound data and/or dutch is inherently harder or something.

japita-se commented 5 years ago

Hi guys! But there is a way to donwload an audiobook from LibriVox and cut it in small clips to produce something similar to LJSpeech? I'm trying to do it for italian...

keithito / tacotron

Is there a tuto, to create from scratch, a homemade model ? #89