mkiol / dsnote

Speech Note Linux app. Note taking, reading and translating with offline Speech to Text, Text to Speech and Machine translation.
Mozilla Public License 2.0
567 stars 20 forks source link

Add a "live" mode that can be used to translate voice communications hands-free continuously #159

Open unfa opened 2 months ago

unfa commented 2 months ago

Hi! Thank you for DSnote, it's incredible software and I am very grateful for it's existance and continued development!

TL;DR: I am looking for wast to help me understand Russian, Ukrainian and other languages spoken over voice chat while I am playing a multiplayer video game.

I was able to get up DSnote to listen to the game's audio output, translate the voice to English and even provide a spoken translation, albeit with substantial delay (I could live with that).

The problem is that the translation in continuous dictation mode is performed after the user manually stops the capture. In a single-sentence mode the application stops completely after one sentence.

What I'd like is something in-between, or maybe a one-sentence loop mode.

  1. Listen to incoming audio and slice it into short sentences.
  2. Process tanscription and translation
  3. Output translated text via TTS
  4. Back to 1.

This would be already a lot. It would be even better if recording could carry on while transcription/translation is being done so that parts of the voice communication is not lost. I am not using GPU acceleration (I have an RX6800 XT GPU and Ryzen 9 3900X), but I have plenty CPU cores so I think my machine could handle "dovetailing" transcription/translation/TTS processing depending on models used.

(BTW: I tried using AMD GPU acceleration but when I install it , DSnote freezes my entire system at startup, so I went back to CPU- that's another topic)

mkiol commented 1 month ago

Hi. Sorry for very late reply. I was vacationing ⛱️.

translation in continuous dictation mode is performed after the user manually stops the capture

I assume you mean the "Translate to English" feature in Whisper models. If so, the translation is done when silence is detected in the audio stream. If in-game speech mixes with other sounds and there are no strict periods of silence, this may not work well.

That's what I'm thinking, in the case you describe, it would be best to use the Vosk engine, because it supports live decoding, and it's also pretty decent in Russian. The only missing part is the translation from Russian to English. Speech Note already has a full translator implemented, but it is not bundled with STT. It is actually on my "TO-DO" list to extend "Translate to English" to translate to any language and for all engines (not only Whisper, but also Vosk and others).

image

What I'd like is something in-between, or maybe a one-sentence loop mode. Listen to incoming audio and slice it into short sentences. Process tanscription and translation Output translated text via TTS Back to 1.

So, you would like to add also TTS... Similar thing has been already requested in https://github.com/mkiol/dsnote/issues/119. Like the idea.

Adding to the backlog.