Open mabergerx opened 6 years ago
No, state of the art systems use much more algorithmic features. The notebook is just a demo for beginners.
I found most of the available models on github are still far from google/microsoft/apple's speech to text performance. What is missing? Training data or language model??
There is a plethora of methods to improve the speech2text performance. Have a look at wer_are_we to find the best performing methods. A summary of useful methods include:
to name just a few methods. Feel free to add more ideas.
How can we apply language model in the t2t decoder process?
Description
We used the "ASR with Transformer" colab notebook which let us load the pre-trained checkpoints of the ASR Problems trained on librispeech and Common Voice datasets. We tried out a few sentences and the results were not very good, for example for librispeech:
Target: "Hello world" Output: "HALLOW WORLDS"
Target: "To which address can we send the official documents?" Output: "THE WITCH ANNA IS COMING SCENT OFFICIAL LOGAMENTS"
If we compare this to the performance of the Google Cloud Speech-to-Text API (and the service used in Google Translate from voice, which I assume is the same API), that performance is very, very good. In the paper, the architecture used is a encoder-decoder one with attention, just as transformer. However a separate language model is used. Does that make such a huge difference? Or are the checkpoints in the colab notebook not trained on such amount of data / for so long as the Speech-to-Text API?
In general, would it be possible to achieve a result which is better than the result in the ASR colab notebook with the Transformer architecture (and for different languages?)?
Thanks!