Closed Cloud299 closed 5 years ago
When comparing performance, keep in mind that google trains its systems on thousands of hours of audio while this models are only trained on hundreds..
we had done some measurements on phone recordings a while ago with similar results. our assessment was that the biggest problem is that real-life phone recordings are very different from the training material we have available which consists mostly of speaker reading written sentences (i.e. no spontaneous, conversational speech).
That said, you could probably improve performance by providing shorter audio segments (training material is typically <12s per segment) and adapting to a language model closer to you target domain. However, this will probably still not get accuracy anywhere near to models trained on actual phone recordings.
you could probably improve performance by providing shorter audio segments (training material is typically <12s per segment)
Thank you for this insight. My .wav files are typically 30 sec to 60 sec. What's the best way to segment an audio so that utterances are not chunked awkwardly ? Would I need to use some kind of voice activity detection (VAD) ?
However, this will probably still not get accuracy anywhere near to models trained on actual phone recordings.
How laborious is it to train my own model ? I have hundreds of thousands of transcriptions (by Google STT) with their corresponding .wav files. I was wondering what's the simplest way to build an ASR model (or perhaps build upon a pre-trained ASpIRE model) with Kaldi to improve accuracy ? I'm quite unfamiliar with Kaldi.
What's the best way to segment an audio so that utterances are not chunked awkwardly ? Would I need to use some kind of voice activity detection (VAD) ?
you will also need to keep the transcript aligned with the segments so VAD alone will not cut it here. There are kaldi recipies available for auto-segmentation - our auto-segmentation tools for audiobooks is based on that, so you could use this as a starting point
https://github.com/gooofy/zamia-speech#audiobook-segmentation-and-transcription-kaldi
however, if you're planning to train your own audio model anyway (see below), I doubt this is worth the effort as you could train your own model on these longer segments right away.
How laborious is it to train my own model ? I have hundreds of thousands of transcriptions (by Google STT) with their corresponding .wav files. I was wondering what's the simplest way to build an ASR model (or perhaps build upon a pre-trained ASpIRE model) with Kaldi to improve accuracy ? I'm quite unfamiliar with Kaldi.
for traditional kaldi recipes you will also need a pronunciation dictionary covering all the words in your transcripts. for starters you could use an existing dictionary and generate missing entries using g2p (e.g. sequitur).
if you want to use our tools check out the README for instructions. if you want to use kaldi directly, you can use one of their (preferably recent) recipies as a starting point for your own experiments.
other than that you could also use an end-to-end ASR engine like wav2letter or deepspeech.
I have a question regarding the performance (in terms of accuracy) of kaldi models (specifically kaldi-generic-en-tdnn_f) versus other well-known engines from Google and Watson when doing speech-to-text. From my preliminary testing (on telephony data), kaldi models are not nearly as accurate as that of Google and Watson. My typical audio file is ~ 30-60 secs long.
Since I'm not sure how this model was trained, should I chunk the audio file to multiple files at sentence length to improve the performance ? Has anyone had good luck with Kaldi in terms of performance as compared to Google and Watson?