Auto-transcription quality leaves much to be desired

Auto-transcription was enabled in an attempt to help people that run calls take more accurate minutes/transcriptions. Unfortunately, the accuracy of auto-transcription, when a fair amount of technical jargon is used, leaves much to be desired. The problems with auto-transcription as we've implemented it currently are:

Google Speech-to-Text API captures every utterance that everyone says.
It then gets key words subtly wrong such that most of the sentences are gibberish.
Then the person cleaning the transcription has to make sense of this AIs drunken ramblings, which takes a lot of time because it has captured every single word that every single person has said.

To address this, we could try to do some of the following:

Turn auto-transcription off by default, so people have to turn it on if they want it.
Enable high fidelity audio -- they claim this improves recognition.
Enable training (this is a privacy concern, we can't use it if a single person on a call does not consent to it).
Ask someone on the call to make substitutions in the chat in real time - this is sort of the halfway point between scribing and using the transcriber. (contributed by Kerri Lemoie)
Add timestamps to the published log so that if the text doesn’t seem right, someone can go to that time in the audio and listen for themselves. (contributed by Kerri Lemoie)
Maybe we could try other speech to text options? This thread mentions Vosk (https://alphacephei.com/vosk/) in relation to Jigasi and Mozilla’s DeepSpeech (https://deepspeech.readthedocs.io/en/r0.9/) https://community.jitsi.org/t/jigasi-open-source-alternative-of-google-speech-to-text/20739/15 Both VOSK & DeepSpeech use Google’s Tensorflow so may not be any better. Online DeepSpeech would have the Jitsi recording fed to it post recording and not in real time which may not be what we want either but putting it out there as an option to consider or trigger other ideas. (contributed by Kerri Lemoie)

I did item # 2 and charles did item # 4 above. The auto-transcription feature is working much better now that we're using a more financially costly audio AI model. Google has a great business model here -- "Those are some pretty words you just said... it'd be a real shame if something were to happen to 'em, pal.") :P

Err, I mean, Google is wonderful and there is nothing wrong with charging good money for a service that provides great value.

The transcriptions seem to be good enough to replace human scribes at this point. The quality is not as good as a /good/ human scribe, and the bots capture every single word that is said (probably too much), but the days of human transcription seem to be numbered.

I've optimized the transcription bot so it stops recording every single utterance, so one and two word quips are not recorded now. That was most of the "clean up" work required for auto-transcribed minutes these days... which takes less time (at least for me) than dealing w/ a human scribe (the output from meeting to meeting is far more consistent now).

I believe this issue is now resolved, closing.

w3c-ccg / community

Auto-transcription quality leaves much to be desired #227