odilia-app / odilia

A fast screenreader for the *nix desktop.
https://odilia.app
GNU General Public License v3.0
121 stars 17 forks source link

VoiceOver-Style Language Switching #21

Open TTWNO opened 2 years ago

TTWNO commented 2 years ago

VoiceOver allows you to switch languages mid-string as well as keep settings unique to each language's voice. For example, if I read the following sentence in VoiceOver: "In Chinese, 你好 means hello!"

VoiceOver will automatically switch between voices and settings to describe the entire sentence in one smooth motion. This can be done even on the tiny CPU on the iPhone SE. Side note: it's not that simple, I think the block of foreign text must be a bit longer for it to switch voices, but there is for sure a threshold and it can switch mid-sentence.

Odilia should eventually have the ability to use this feature. Obviously, without voices set up through speech-dispatcher, you may need to fall back to espeak, but it's still a way to read and write multi-language documents; you should not need to switch back and forth.

Language identification, unless I'm completely wrong, is very likely a fairly complex process relatively speaking. So this should always be optional and as a setting for the user to change.

I believe this should be possible through using the SSIP protocol with speech-dispatcher. I haven't looked deep enough to figure this one out myself, but I suspect that if it isn't possible like this, then I'm really not sure how it could be. More research required.

albertotirla commented 2 years ago

there are afew ways to deal with this, mostly two come to mind right now:

Also, what you saw in voiceover is voice specific, not voiceover specific, some of that can be done in espeak as well. What's happening over there is that the voice itself knows when a transition from a latin to a non-latin alphabet is happening, therefore it does its own language selection when that text is given to it. I can't write chinese examples because espeak doesn't support that or japanese, so I will do something similar with ukrainian. So, DeepL translate says "ласкаво просимо до оділії скринрідера! "means "welcome to odilia screenreader!" If you read that with espeak, you will hear it change voice and language to spell it as well as it can given your locale, codepage and such. Even though that's not voiceover, nvda or orca specific, we can potentially make that odilia specific, as long as the speech dispatcher module currently in use supports the detected language, which can be wrong sometimes, but better than nothing. also, we have the problem that speech dispatcher doesn't allow us to change language midsentence, however we can possibly do language processing before feeding the thing to speech dispatcher, then in the language processing faze we insert speech markers where language changes if we can accurately determin that, then when the callback fires with a marker reached event, we know to change language. We could probably track what language we have to change to with some kind of text position, marker name, or whatever that marker event contains, to language mapping. Yes, this may possibly delay spoken things a lot, I'm not sure, however it's a plan of action if nothing else comes to mind untill that feature would come to be implemented.

mcb2003 commented 2 years ago
  • the most straight-forward one is to just rely on text attributes, especially on the web. Correct me if I'm wrong, since I don't have much of a web background, however every html page can be marked as being in a specific language, with the language attribute, for example language="en-us". In that case, won't it be logical that specific paragraphs or pieces of text in a paragraph could similarly be annotated with language tags? as an aside, I think that's how wikipedia does it.

From my understanding this is correct, yes.

  • we can use something like lingua-rs, which won't use html, in stead it actually properly detects the language being used in the text, even though this requires more processing power and would have to be set behind a user-configurable flag, this isn't ment to be in use for long because of the not small at all! memory and cpu consumption, I believe it uses machine learning or something close, so the resource hog is expected.

Will look into this more, but cool.

also, we have the problem that speech dispatcher doesn't allow us to change language midsentence, however we can possibly do language processing before feeding the thing to speech dispatcher, then in the language processing faze we insert speech markers where language changes if we can accurately determin that, then when the callback fires with a marker reached event, we know to change language. We could probably track what language we have to change to with some kind of text position, marker name, or whatever that marker event contains, to language mapping. Yes, this may possibly delay spoken things a lot, I'm not sure, however it's a plan of action if nothing else comes to mind untill that feature would come to be implemented.

A much simpler solution would be to use SSIP blocks.

TheQuinbox commented 2 years ago

Unicode character ranges can also be used for most languages with latin-alphabets, for what its worth. Might also be worth looking into how NVDA on Windows does this.

albertotirla commented 2 years ago

Unicode character ranges can also be used for most languages with latin-alphabets, for what its worth. Might also be worth looking into how NVDA on Windows does this.

nvda doesn't do a very good job of it either as far as i know, not speaking from a coding/implementation viewpoint here, rather from a user one. Most of the language processing on nvda is either handled by the synthesizer currently speaking it, or by nvda itself, but as far as I know nvda only switches language when UIA or whatever changes the language attribute of the currently read text to something else than the language of the current voice, for example if a paragraph is annotated with the language attribute. About using character ranges, probably that's one of the tricks lingua-rs uses as well, but that alone doesn't guarantee any reliability whatsoever. For example, just try distinguishing, based on that method, german text from an english translation. We know that german has ü, ö, ä, and ß, however once we identified those, what do we do? consider the whole lexical unit german, or try to identify, with a german dictionary, the smallest part of that unit that's german and speak that? what can even be considered a lexical unit, how do we do this, do we make a synthesizer level engine and shuv that in odilia? Or maybe I'm misunderstanding what you mean, in which case please post back with an example or a wider explanation, since all this will be taken into account when we will arrive to that feature set and will have to revisit this in order to implement it.

TheQuinbox commented 2 years ago

Look at www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml for a list of different languages, and their Unicode character ranges. Wouldn't work for all languages, though.

albertotirla commented 2 years ago

I wanted to reply to your comment via email, but I guess github doesn't want me to do that, so yeah, will have to post in this field again thanks for that link, will be very useful, even though I don't personally understand much from it since it's not an actual html table and it's kinda confusing me. Yes, I see what you mean now, however those character ranges are pretty much all non-latin alphabets, aka hiragana and catacana, so that method won't help us separating, say, english from german, plus a synthesizer with that capability can recognise such languages on its own already.