gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition
GNU Lesser General Public License v3.0
443 stars 86 forks source link

Further support for French #78

Open pguyot opened 4 years ago

pguyot commented 4 years ago

Further support for French which allowed me to build a reasonable model for French (tdnn 250, tdnn f still being built).

%WER 30.17 [ 42464 / 140755, 3870 ins, 12822 del, 25772 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_7_0.0

WER is quite high (higher than previously reported), however this probably is because previous model WER was poorly computed (against a small set of test voices) and this new model is based on a lot of noisy corpora.

pguyot commented 4 years ago

Resulting models are available here: https://github.com/pguyot/zamia-speech/releases/tag/20190930

joazoa commented 4 years ago

Thanks for the great work pguyot!

As mentioned in the other ticket, I have been trying to make the french models based on the zamia master branch. (I didn't see your most recent commits until last night :/

What i meant with the hanging import from est_republicain, this is what i meant: When i tried to run ./speech_sentences.py est_republicain, I noticed that this script seemed to hang (it doesn't actually hang just goes really slowly according to strace). I left it for a couple of hours before giving up.

On my setup after running xmllint --xpath '//[local-name()="div"][@type="article"]//[local-name()="p" or local-name()="head"]/text()' Annee/.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g ; chomp' > est_republicain.txt, the result is a 1.8GB file without newlines.

I didn't spend much time on it, i used sed 's/. /.\n/g' to add some newlines and that seemed to do the trick for now.

I got around the quality by changing the audio scan script to add quality 2 and make ts a lowercase version of the prompt.

The .ipa i was referring to is the french dictionary that comes with Zamai, it is missing pronunciations for these words: hypnoanalgésies, ibogaïne, ibogaïnes, malabéen, malabéenne, malabéennes, malabéens, patagonien, patagonienne, patagoniennes, patagoniens, sulfamidé, sulfamidée, sulfamidées, sulfamidés, théophanique, théophaniques, xavière, xavières

For now I just deleted those lines as I'm first trying to get it to actually train before doing things more properly. I just saw you commit some sources a week ago that I had not seen yet, and it includes a newer dictionary with different encoding, I will be sure to give that a try.

The next issue I encountered was the import_cv_fr script returning uttid %s ist not unique!, i'm not at that PC at the moment but I believe I was able to patch it somehow to make it work, although my resulting file is not the same as yours.

I did not fix this yet: "create one dir per utt (since we have no speaker information)"- as the cv_fr files do seem to contain the speaker information. My cv_fr, voxforge_fr and m_ailabs_fr spk_test.txt are all empty, which i presume may also the reason why i have no test samples when i start the training (and why the training failes).

Are you generating those files manually or do you use GenerateCorpora for tha or some other way to split into train / dev / test ?

I saw you also wrote import scripts for some more corpora, that was going to be my plan for the coming week so you already saved me quite some time there. !!!

I may have some more text and audio corpora to add to the list.

pguyot commented 4 years ago

I didn't spend much time on it, i used sed 's/. /.\n/g' to add some newlines and that seemed to do the trick for now.

Could you please precisely suggest a revision of the README.md file to explain the sed trick? To be completely transparent, I did not really use the xmllint line but cooked this as a replacement for a more complex combination of scripts and text editing I used relying on a lot of third-party dependencies that seemed overkill.

I got around the quality by changing the audio scan script to add quality 2 and make ts a lowercase version of the prompt.

Umm. I realize I invoke speech_sentences.py with -p option. Is it at this step that you need to add quality to the csv transcripts? We could simply document the -p case in README.md.

I believe the ipa file in this pull request does not have these errors.

Likewise, I fixed the spk_test.txt files in this pull request, which is why the WER increased compared to my previous attempt, yet the error rates on real world sounds decreased significantly (it's not perfect, but it looks like it is recognizing something).

To generate the spk_test.txt files, I found out that Guenter had a specific proportion of speakers (5% if I remember properly) and I picked them randomly.

Please be aware of the license of audio and text corpora. The corpora I added are available under a CC-BY-NC-SA license, which is fine for me but represent a stronger constraint compared to Mozilla CV.

pguyot commented 4 years ago

Finally finished the tdnn_f model.

%WER 25.40 [ 35749 / 140755, 3408 ins, 11742 del, 20599 sub ] exp/nnet3_chain/tdnn_f/decode_test/wer_7_0.0

Downloadable from the same link. https://github.com/pguyot/zamia-speech/releases/tag/20190930

a-rose commented 3 years ago

Hello,

I'm bumping this because I'm interested in building french models. I managed to build a small model using this branch and a subset of the dataset. Are there any outstanding issues preventing this from getting merged I could look into ?