jitsi / jigasi

Jigasi: a server-side application acting as a gateway to Jitsi Meet conferences. Currently allows regular SIP clients to join meetings and provides transcription capabilities.
Apache License 2.0
533 stars 298 forks source link

Speech-to-text feature #157

Open woj-i opened 5 years ago

woj-i commented 5 years ago

I know, that the meeting transcription was a subject on the GSOC 2017 and the outcome is described here https://nikvaessen.github.io/jekyll/update/2017/08/01/speech-to-text-prototype-in-Jitsi_Meet.html and the followup announcement for 2018 is here https://jitsi.org/gsoc/s2t/

The main pain point of this solution is, that the service for speech recognition is Google service, not any open-source. It was mentioned in this preso https://archive.fosdem.org/2018/schedule/event/jitsi/, that the Deep Speech is in the further work of the transcription module.

Nowdays there is a brand-new Java API for Deep Speech framework, so I think it can be a very good opportunity to implement the Deep Speech as the speech-to-text engine in Jitsi.

The aim of this issue it to encourage contributors on development on this exciting topic, as well to understand the current state of the meeting transcriptions in Jitsi. I can see https://github.com/jitsi/Sphinx4-HTTP-server that use HTTP to communication, but I am not sure was it just an experiment or a way, that developers of Jitsi want to follow. If so here is some work on standardization of web API for speech-to-text https://w3c.github.io/speech-api/webspeechapi.html

nikvaessen commented 5 years ago

Thanks for your interest in the speech-to-text features!

There are currently no plans for integrating deepspeech. I have tested the performance of the pre-trained models Mozilla has released, and they were inadequate at giving a good transcription. I think this was about a year ago, however, so things might have changed. Moreover, I'm not sure what kind of server power is required to run speech recognition quickly for multiple conferences at once. The cost of this (if self-hosted) might be more expensive than relying on the google (ignoring privacy, would need more analysis, and assuming good transcription results...).

The sphinx-4 library was an attempt at implementing our own solution, but we quickly ran into performance issues. That is why we switched to using the Google API, which was a more pragmatic choice. Unfortunately, open-source solutions for speech-to-text are simply not good enough for meetings as of yet (and as far as I'm aware of current solutions).

There is also currently no interest in using speech-to-text for controlling the user-interface.

p.s We normally use https://community.jitsi.org/ for discussing development

nshmyrev commented 4 years ago

We have implemented support for the Vosk speech recognition server here:

https://github.com/jitsi/jigasi/pull/294

I wish that patch can be integrated sooner.