TDNN model performance - Githubissues

Cloud299 commented 5 years ago

I have a question regarding the performance (in terms of accuracy) of kaldi models (specifically kaldi-generic-en-tdnn_f) versus other well-known engines from Google and Watson when doing speech-to-text. From my preliminary testing (on telephony data), kaldi models are not nearly as accurate as that of Google and Watson. My typical audio file is ~ 30-60 secs long.

Since I'm not sure how this model was trained, should I chunk the audio file to multiple files at sentence length to improve the performance ? Has anyone had good luck with Kaldi in terms of performance as compared to Google and Watson?

the01 commented 5 years ago

When comparing performance, keep in mind that google trains its systems on thousands of hours of audio while this models are only trained on hundreds..

gooofy commented 5 years ago

we had done some measurements on phone recordings a while ago with similar results. our assessment was that the biggest problem is that real-life phone recordings are very different from the training material we have available which consists mostly of speaker reading written sentences (i.e. no spontaneous, conversational speech).

That said, you could probably improve performance by providing shorter audio segments (training material is typically <12s per segment) and adapting to a language model closer to you target domain. However, this will probably still not get accuracy anywhere near to models trained on actual phone recordings.

Cloud299 commented 5 years ago

you could probably improve performance by providing shorter audio segments (training material is typically <12s per segment)

Thank you for this insight. My .wav files are typically 30 sec to 60 sec. What's the best way to segment an audio so that utterances are not chunked awkwardly ? Would I need to use some kind of voice activity detection (VAD) ?

However, this will probably still not get accuracy anywhere near to models trained on actual phone recordings.

How laborious is it to train my own model ? I have hundreds of thousands of transcriptions (by Google STT) with their corresponding .wav files. I was wondering what's the simplest way to build an ASR model (or perhaps build upon a pre-trained ASpIRE model) with Kaldi to improve accuracy ? I'm quite unfamiliar with Kaldi.

gooofy commented 5 years ago

What's the best way to segment an audio so that utterances are not chunked awkwardly ? Would I need to use some kind of voice activity detection (VAD) ?

you will also need to keep the transcript aligned with the segments so VAD alone will not cut it here. There are kaldi recipies available for auto-segmentation - our auto-segmentation tools for audiobooks is based on that, so you could use this as a starting point

https://github.com/gooofy/zamia-speech#audiobook-segmentation-and-transcription-kaldi

however, if you're planning to train your own audio model anyway (see below), I doubt this is worth the effort as you could train your own model on these longer segments right away.

How laborious is it to train my own model ? I have hundreds of thousands of transcriptions (by Google STT) with their corresponding .wav files. I was wondering what's the simplest way to build an ASR model (or perhaps build upon a pre-trained ASpIRE model) with Kaldi to improve accuracy ? I'm quite unfamiliar with Kaldi.

for traditional kaldi recipes you will also need a pronunciation dictionary covering all the words in your transcripts. for starters you could use an existing dictionary and generate missing entries using g2p (e.g. sequitur).

if you want to use our tools check out the README for instructions. if you want to use kaldi directly, you can use one of their (preferably recent) recipies as a starting point for your own experiments.

other than that you could also use an end-to-end ASR engine like wav2letter or deepspeech.

gooofy / zamia-speech

TDNN model performance #46