[Feature Request] Voice Translated Text (TTS functionality)

sharadagg commented 7 months ago

Hi

There are a many instances where user's wish to hear the original audio in their native language. Would it be possible to enhance the extension to use give audio output in tranlated language.

One option would be to use Speech to Speech translation like this https://replicate.com/cjwbw/seamless_communication

or use TTS after the text translation is done using something like bark or an api key from elevenlabs https://replicate.com/suno-ai/bark

kappaflow commented 7 months ago

Hi,

Can you elaborate on the use case for speech-to-speech (text-to-speech) translation?

I’ve already considered adding TTS functionality, although I can’t use AI models, because running a such infrastructure is very expensive, but I can use the browser API for TTS.

But at the end of the day, I didn’t find a real use case for how users can benefit from it. In case of audio Setup 1 and playing TTS, it would interfere with audio capturing for the translation. It’s possible to avoid this only with a much more complicated audio setup. But still, it doesn't resolve the issue with the delay until the final translation is done...

It could work with the audio Setup 2 (audio captured from the mic), but in this case I don’t know any use cases for the end users. And basically you can even do it right now, by using the Read aloud feature of Edge. Just right click on the page with the translation and select that option.

sharadagg commented 7 months ago

Thank for writing back. Really appreciate the effort being made through this extension.

My current specific use case is people who are taking online live courses which are currently only available in English. Their native language is Hungarian, Romanian. They need to hear the courses in their native language. Quite of few of them struggle to read text (some due to dyslexia, others prefer audio to text).

For them idea would be using setup 1.. browser audio is routed via virtual cable as a mic input to the extension and the translated audio output from the extension is heard on the laptop speaker/headset. So the end user basically doesn't hear anything of the original audio but only the translated audio.

I can see lot of people being able to access youtube videos or live streams as translated streaming audio in their languages. For example my parents would prefer to hear youtube content in their local language instead of reading it. Currently I so not know of any service offering near realtime translated audio output (3-5 secs delay is fine).

Services like Google / Azure translate or even Whisper are able to translate streaming audio input https://github.com/ufal/whisper_streaming The subsequent step of translated audio is the missing step. A TTS based approach may introduce latency but which may still give an acceptabe solution or the alternate would be to try a S2ST approach.

kappaflow commented 7 months ago

Hmm, yes, you can use a different device for the input and the output, but you will be missing every other sound which is not a voice... I didn't know people actually interested in a such application for the extension.

To learn more about the delay issue (why its difficult to avoid any delay for the translation voicing) you can check this article: https://blog.research.google/2023/08/modeling-and-improving-text-stability.html Basically the final version of the text could be different to the one it shows you initially (for the interim text). And the changes may occur in the beginning of the sentence as well. And it depends on the language... Such changes happen even with the transcribed text and it gets even worse after the translation. So its a better idea to read the final translation. But it will cause a delay. This is how this delay can be adjusted:

The interim transcribing and translation text turns to final chunk by chunk. For the best results, the speech should have a small pause after each phrase or sentence, so the pause gets autodetected and no words are skipped in-between.

But sometimes there are no pauses in a speech for a long time, so to keep the chunks a predictable size, interim text gets forcibly converted to final. In this case, some words may be skipped. The Max number of letters per translation in the Expert Options) defines how many transcribed letters interim text has to reach to be forcibly converted to final.

Please keep in mind that some languages have more words per letter rate. It happens when a single letter represents a syllable or an entire word. So you should keep in mind the original language words per letter rate when you set the Max number of letters per translation value.

But even in case of having a small chunk size, it will be a quite noticeable delay for the translation voicing. The best scenario for the video is using existed subtitles and voicing them. I think there are extensions that can translate and read aloud YouTube subtitles for example.

speech-translator-ext / speech-translator-readme

[Feature Request] Voice Translated Text (TTS functionality) #6