gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition
GNU Lesser General Public License v3.0
443 stars 84 forks source link

Training wav2letter++ streaming convnets (TDS + CTC) #101

Open erksch opened 4 years ago

erksch commented 4 years ago

Hey!

First of all, I think your work is amazing and making all your models available is just so generous.

I checked out your german wav2letter model and as I can tell from your train config (w2l_config_conv_glu_train.cfg) the acoustic model is based on conv_glu with ASG criterion from the original wav2letter paper.

Facebook released its streaming_convnets version in January which allows online speech recognition with streaming capability and I would kill for having a german model for that. Here is a link to the architecture file and the training config.

I want to train the acoustic model with the hardware resources I have available and updated german speech corpora (like the most recent common voice with 500 hours of german speech).

Regarding your experience in training a wav2letter model:

Vielen Dank :)

gooofy commented 4 years ago

Hi Erik,

On Sat, Apr 4, 2020 at 4:16 PM Erik Ziegler notifications@github.com wrote:

First of all, I think your work is amazing and making all your models available is just so generous.

thank you :)

How many and what GPUs do you use for training? (The wav2letter guys said here that they used 32 GPUs for training the streaming convnets acoustic model, which sounds a little bit insane)

I used a single 1080Ti

How much RAM does the system need to have or is it primarily GPU work?

not sure how much is actually needed, the system I used has 64GB of RAM

How long did the training of your wav2letter model took?

my memory may be off here, but I think some 4-6 months

Are there any pitfalls when training for wav2letter?

well, as with most models the language model has a high impact on the final WER results. I also remember the code wasn't as robust as kaldi back than but I guess that should have improved by now

good luck with training your model! :)

guenter

erksch commented 4 years ago

Thank you for your reply and the insights! That's a lot of time :sweat_smile:

Maybe as you speak of it, what about your language models? How long did it take, say for the large order 6 german lm?

And if I have domain-specific words that I really want my speech recognition to know about, should I add examples of that to the speech corpora or should I make sure that those words are well represented in the language model text corpora? Or both? Or should the language model text corpora be identical to the speech corpora text?

Sorry for the questioning :D

svenha commented 4 years ago

Hi Erik.

Your project sounds interesting. I have only one remark about annotation quality, because this problem came up several times in this project and Guenter spent a lot of time to correct annotation problems in speech corpora. So, if you include the latest Common Voice data set, a very new data set, I would be cautious and would try to spot problematic audio files and/or annotations.

Just curious: What WERs are you expecting?

Sven

erksch commented 4 years ago

@svenha you're right. Theoretically, the Common Voice dataset should be already reviewed by the users, but I don't know if that actually ensures the data's quality.

Regarding WER I am not expecting anything. I'll compare it to a microphone streaming implementation with a Kaldi model (the german Zamia Kaldi model) and see what feels better and more robust.

gooofy commented 4 years ago

Hi Erik,

On Sat, Apr 4, 2020 at 5:30 PM Erik Ziegler notifications@github.com wrote:

Maybe as you speak of it, what about your language models? How long did it take, say for the large order 6 german lm?

don't remember exactly, but not very long - maybe a few days at most

And if I have domain-specific words that I really want my speech recognition to know about, should I add examples of that to the speech corpora or should I make sure that those words are well represented in the language model text corpora? Or both?

more data is always good :)) ideally, you want recordings of all those domain specific words in multiple contexts by multiple speakers using different microphones, environments etc. and of course your language models should cover these words in realistic contexts as well.

cheers,

guenter

lagidigu commented 4 years ago

@erksch did you have success training a streaming convnet on the mozilla dataset? I would be attempting something similar.