[Question] ASR Transformer performance vs. Google Speech-to-Text

mabergerx commented 6 years ago

Description

We used the "ASR with Transformer" colab notebook which let us load the pre-trained checkpoints of the ASR Problems trained on librispeech and Common Voice datasets. We tried out a few sentences and the results were not very good, for example for librispeech:

Target: "Hello world" Output: "HALLOW WORLDS"

Target: "To which address can we send the official documents?" Output: "THE WITCH ANNA IS COMING SCENT OFFICIAL LOGAMENTS"

If we compare this to the performance of the Google Cloud Speech-to-Text API (and the service used in Google Translate from voice, which I assume is the same API), that performance is very, very good. In the paper, the architecture used is a encoder-decoder one with attention, just as transformer. However a separate language model is used. Does that make such a huge difference? Or are the checkpoints in the colab notebook not trained on such amount of data / for so long as the Speech-to-Text API?

In general, would it be possible to achieve a result which is better than the result in the ASR colab notebook with the Transformer architecture (and for different languages?)?

Thanks!

nshmyrev commented 6 years ago

No, state of the art systems use much more algorithmic features. The notebook is just a demo for beginners.

cwlinghk commented 5 years ago

I found most of the available models on github are still far from google/microsoft/apple's speech to text performance. What is missing? Training data or language model??

cantwbr commented 5 years ago

There is a plethora of methods to improve the speech2text performance. Have a look at wer_are_we to find the best performing methods. A summary of useful methods include:

Huge training sets (The authors of this paper use more than 27k hours of speech)
Spectrum Augmentation (An implementation for t2t can be found here: https://github.com/Kyubyong/specAugment)
Extended text encodings like chenones (Described here) and/or subword unit encoding (Described here for instance)
Language models applied in the decoder step

to name just a few methods. Feel free to add more ideas.

cogmeta commented 4 years ago

How can we apply language model in the t2t decoder process?

tensorflow / tensor2tensor

[Question] ASR Transformer performance vs. Google Speech-to-Text #1121

Description