jech / galene

The Galène videoconference server
https://galene.org
MIT License
899 stars 119 forks source link

Feature request (Accessibility): Optional Speech-to-Text (STT) integration #202

Open TechnologyClassroom opened 1 month ago

TechnologyClassroom commented 1 month ago

Adding speech to text to Galène would greatly help people that have trouble hearing. The whisper and vosk models can be self-hosted and run on low resource machines which could potentially pair well with Galène. livestream.sh is an example of nearly real-time transcription with whisper using only CPU.

I recognize that this could be a difficult item, but I do want to put it out there as it could have a great impact.

jech commented 1 month ago

I'm open to the idea, but I'd need to speak with the people who actually need the feature. In particular, I'd need to understand why they don't use a system-wide speech-to-text system.

I have spoken to visually impaired users of Galene, and they tell me that they use a system-wide screenreader, and therefore don't need TTS support in Galene itself, they just need the Galene UI to be accessible (which is apparently the case). Before implementing the feature you request, I need to understand whether hearing impaired users use a system-wide speech-to-text system, and, if they don't, why.

If the issue is that there are no good speech-to-text systems for free OSes, then in my opinion we should work on building one, rather than adding speech-to-text support to every single application.

TechnologyClassroom commented 1 month ago

Those are good questions.

The technology exists today for free desktop OSes, but it is still in the developer skill-set range and not a user-friendly range. The above script could be run in a local terminal on old laptops and connected to the desktop-audio instead of the microphone to get a local live-transcription in near real-time. The terminal would need to be always on-top and and take up enough of the screen real-estate to be useful. The setup for local whisper models takes some command line experience which not everyone is familiar with. There is definitely work that could be done to make this process easier such as GUIs, packaging, and installers. On the mobile front, it is still in the very early stages and processing power could be an issue.

If you run an event that may or may not have hearing issues and supplying all of the technology yourself, the local whisper system would need to be configured on all of the desktop machines and someone would need to introduce how to get it started if and when it is needed. Individual system configurations would scale poorly in this scenario.

Jitsi Meet with Jigasi adds the optional functionality of transcription followed by option functionality of translation through LibreTranslate. Transcription would be the first step towards translation.

If the event organizer could get TTS working once on the conferencing system, then all users could benefit whether they needed TTS, prefer subtitles, or are not native language speakers. The TTS could be integrated into the chat system or some other intuitive way that does not leave the users switching between two windows, trying to balance the screen sizes to experience the chat to the fullest extent, waiting for a model to download before they can participate, or not being able to participate on their mobile device.

jech commented 1 month ago

Ah-ha, you're thinking of server-side TTS. Yes, that makes more sense.

I think this could by done by writing a separate client that connects to the Galene server and does TTS then publishes the resulting text in the chat. This could be run on any computer, which would avoid putting CPU intensive stuff on the Galene server.

Please don't hold your breath.