tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.13k stars 3.45k forks source link

[Question] ASR Transformer performance vs. Google Speech-to-Text #1121

Open mabergerx opened 5 years ago

mabergerx commented 5 years ago


We used the "ASR with Transformer" colab notebook which let us load the pre-trained checkpoints of the ASR Problems trained on librispeech and Common Voice datasets. We tried out a few sentences and the results were not very good, for example for librispeech:

Target: "Hello world" Output: "HALLOW WORLDS"

Target: "To which address can we send the official documents?" Output: "THE WITCH ANNA IS COMING SCENT OFFICIAL LOGAMENTS"

If we compare this to the performance of the Google Cloud Speech-to-Text API (and the service used in Google Translate from voice, which I assume is the same API), that performance is very, very good. In the paper, the architecture used is a encoder-decoder one with attention, just as transformer. However a separate language model is used. Does that make such a huge difference? Or are the checkpoints in the colab notebook not trained on such amount of data / for so long as the Speech-to-Text API?

In general, would it be possible to achieve a result which is better than the result in the ASR colab notebook with the Transformer architecture (and for different languages?)?


nshmyrev commented 5 years ago

No, state of the art systems use much more algorithmic features. The notebook is just a demo for beginners.

cwlinghk commented 4 years ago

I found most of the available models on github are still far from google/microsoft/apple's speech to text performance. What is missing? Training data or language model??

cantwbr commented 4 years ago

There is a plethora of methods to improve the speech2text performance. Have a look at wer_are_we to find the best performing methods. A summary of useful methods include:

to name just a few methods. Feel free to add more ideas.

cogmeta commented 4 years ago

How can we apply language model in the t2t decoder process?