Automatic language detection based on unicode ranges

nvaccessAuto commented 11 years ago

Reported by ragb on 2013-02-13 12:26

This is kind of a spin-of of #279.

As settled some time ago, this proposal aims to implement automatic text “language” detection for NVDA. The main goal of this feature is for users to read text in different languages (or better said, language families) using proper synthesizer voices. By using unicode character ranges, one can understand at least the language family of a bunch of text: Latine-based (english, german, portuguese, spanish, french,…),, cyrilic (russian, ukrainian,…), kanji (japanese, maybe korean, - I that already written but it is too much for my memory), greek, arabic (arabic, farsy), and others more.

In broad terms, the implementation of this feature in NVDA requires the addition of a detection module in the speech sub system, that intercepts speech commands and adds “fake” language commands for the synth to change language, based on changes on text characters. It is also needed an interface for the user to tell NVDA what particular language to choose for some language family, that is, what to assume for latin-based, what to assume for arabic-based characters, etc.

I’ve implemented a prototype of this feature in a custome vocalizer driver, with no interface to choose the “proper” language. Prliminary testing with arabic users, using arabi and english vocalizer voices, has shown good results, that is, people like the idea. Detection language code was adapted from the Guess_language module, removing some of the detection code which was not applicable (tri-gram detection for differentiating latin languages, for instance).

I’ll explain the decision to use, for now, only unicode based language detection. Language detection could also be done using trigrams (see here for instance), dictionaries, or other heuristics of that kind. However, the text that is passed each time for the synthesizer is very very small (a line of text, a menu name, etc), which makes these processes, which are probabilistic by nature, very very error-prone. From my testing, applying trigram detect for latin languages in NVDA showed completely unusable, further from adding a noticeable delay when speaking. For bigger text content (books, articles, etc.) it seems to work well, however I don’t know if this can by applied somehow in the future, say by analyzing virtuel buffers, or anything.

Regarding punctuation, digits, and other general characters, I’m defaulting to the current language (and voice) of the synth.

I’ll create a branch with my detection module integrated within NVDA, with no interface.

Regarding the interface for selecting what language to assume for each given language group (when applicable, greek, for instance, is only itself), I see a dialog with various combo boxes, each one for each language family, to choose the language to be used. I think restricting the available language choices from the available languages of the current synth may improve usability. I don’t know where to put that dialog, or what to call it (“language detection options”?).

Any questions please ask.

Regards,

Rui Batista Blocked by #5427, #5438

nishimotz commented 7 years ago

@dineshkaushal updated my pull request regarding number characters.

without the fix, cases such as follows cause the problem:

1個
(one item in English)

In this case, number characters should be treated as Japanese text. Otherwise, "one" in English, then "ko" (reading of the ideographic character) in Japanese. It is so stupid. Japanese TTS handles this whole text and gives the correct reading "ikko."

mohdshara commented 7 years ago

I need help with this: if I run git clone --recursive https://github.com/nvda-india/nvda/tree/in-t2990-review I get: fatal: repository 'https://github.com/nvda-india/nvda/tree/in-t2990-review/' not found cloning the whole nvda-india works, however it doesn't include this tree. I am sure git experts can tell.

nishimotz commented 7 years ago

git clone --recursive -b in-t2990-review https://github.com/nvda-india/nvda

mohdshara commented 7 years ago

@nishimotz thanks a lot. that worked. it works beautifully with Windows one core voices. Is there a way to choose which voice speaks a language if there's more one such voice in that synth?

jcsteh commented 7 years ago

No; you can't choose the specific voice. This choice is made by the synth. Supporting this will be possible using the same technique we will use to suppoort synth switching.

dineshkaushal commented 7 years ago

Thanks @nishimotz for the fix.

But I thought this scenario should be covered by the common Unicode category? My understanding is that the algorithm does not work for your example of 1個 as number comes before the character 個. Can you verify if number coming after the Japanese character works fine.

In that case instead of adding “Number” as a separate category, we could change the processing so that language code will apply for previous common category string if there is no earlier language code. This should solve the above scenario. the current implementation should take the default language code for the common category so this problem should not occur.

Could you also give me log so that I could check what default language is showing for the above example with a Japanese synthesizer?

Thanks for other improvements as well, the code is looking better.

From: James Teh [mailto:notifications@github.com] Sent: Wednesday, August 16, 2017 2:06 AM To: nvaccess/nvda nvda@noreply.github.com Cc: dineshkaushal dineshkaushal@gmail.com; Mention mention@noreply.github.com Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)

No; you can't choose the specific voice. This choice is made by the synth. Supporting this will be possible using the same technique we will use to suppoort synth switching.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-322581214 , or mute the thread https://github.com/notifications/unsubscribe-auth/AE08v_LGb-klkJ-QfXu4bRh-JaA09BAoks5sYgEVgaJpZM4LHyH_ . https://github.com/notifications/beacon/AE08v0j6ZBkXkARiV5D_WHoFamvW8fNaks5sYgEVgaJpZM4LHyH_.gif

nishimotz commented 7 years ago

Original code treats numbers as Common category. Because detectScript() ignores Common category, the language code of digit numbers will be same as the preceding characters. For example, even if Japanese has higher priority, "Excel 2016" is spoken in English to the end. It is difficult to understand for Japanese language users.

My modification treats digit numbers, for all languages, as their native script, so the preferred language priority is respected. For example, if Japanese has higher priority, "Excel" is spoken in English and "2016" is in Japanese. This is much easier to understand.

dineshkaushal commented 7 years ago

I have merged modifications proposed by @nishimotz and added a few unit tests for language detection.

Based on these unit tests, I found that language detection didn’t work properly for numbers if there is no preferred language added.

I have made some corrections so that if there is no preferred language then the default language is used for numbers. The default language is the language reported by the synthesizer. I have also renamed a parameter to make it read better.

@nishimotz could you test the modifications and add more tests specially for Japanese and Chinese?

Thanks

From: Takuya Nishimoto [mailto:notifications@github.com] Sent: Wednesday, August 16, 2017 3:25 PM To: nvaccess/nvda nvda@noreply.github.com Cc: dineshkaushal dineshkaushal@gmail.com; Mention mention@noreply.github.com Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)

Original code treats numbers as Common category. Because detectScript() ignores Common category, the language code of digit numbers will be same as the preceding characters. For example, even if Japanese has higher priority, "Excel 2016" is spoken in English to the end. It is difficult to understand for Japanese language users.

My modification treats digit numbers, for all languages, as their native script, so the preferred language priority is respected. For example, if Japanese has higher priority, "Excel" is spoken in English and "2016" is in Japanese. This is much easier to understand.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-322722525 , or mute the thread https://github.com/notifications/unsubscribe-auth/AE08v9lf5_61fkScad-k2ShovuuSIyBpks5sYrxtgaJpZM4LHyH_ . https://github.com/notifications/beacon/AE08v5L4Ea5etJFDjL3FcG_-C6jNAnQeks5sYrxtgaJpZM4LHyH_.gif

nishimotz commented 7 years ago

Use of default language sounds good, however, I found an issue with your new revision.

setup:

Windows 10 Japanese (English available as additional language)
NVDA General settings > langauge : en (English)
NVDA preferred language : empty
NVDA Synthesizer : OneCore voice

procedure:

open NVDA menu > Preferences
move to "Windows 10 OCR"
expected : English voice "Windows ten o c r"
actual : English voice "Windows", Japanese "juu (ten in Japanese)", English "o c r"

nishimotz commented 7 years ago

Tests are working as expected.

The second parameter of detectLanguage() is given in speech.py. The locale value is used as default language of language detector.

However, if automatic language detection is enabled at the NVDA voice setting, locale value is set to the synthesizer's default language. If Microsoft David is selected, locale is set to 'en_us.' If Microsoft Ichiro is selected at the voice setting, locale is set to 'ja_jp,' even NVDA general setting is set to English. As the result, if English is set to NVDA language, number is spoken in Japanese.

Am I correct? Is that the expected behavior?

nishimotz commented 7 years ago

I have learned more about your code. I am still not sure how voice language (aka default language) and prerefenres should be used. For example, this test, written by me, fails. It is because second parameter of detectLanguage has higher priority than preferred languages, so Number always respects the voice language. Is it relevant or not?

    def test_case1(self):
        combinedText = u"Windows 10 OCR"
        config.conf["languageDetection"]["preferredLanguages"] = ("ja",)
        languageDetection.updateLanguagePriorityFromConfig()
        detectedLanguageSequence = languageDetection.detectLanguage(combinedText, "en_US")
        self.compareSpeechSequence(detectedLanguageSequence, [
            LangChangeCommand("en"),
            u"Windows ",
            LangChangeCommand("ja"),
            u"10 ",
            LangChangeCommand("en"),
            u"OCR"
        ])
        config.conf["languageDetection"]["preferredLanguages"] = ()
        languageDetection.updateLanguagePriorityFromConfig()

dineshkaushal commented 7 years ago

As per original design, the second parameter i.e. defaultLanguage of detectLanguage was used to decide whether we would add languageChange command or not. So if defaultLanguage is English and if text string is in Latin and preferred language is English, then there would be no languageChange command as it is added before calling this function.

The purpose of preferred language was to choose a language from list of languages that have the same script.

I thought common script property would take care of numbers and punctuations. Common script seems to be working for punctuation, but for numbers I am not very sure. There could be following scenarios:

If language before numbers and after the numbers is same, then we could default to that language, and common property does that very well.

If language is followed by a number then we could speak the number in that language, but you suggested that for excel 2016, you want numbers to be spoken in Japanese even though text is in English. For that we need a way to determine which language should we use for numbers

If number is followed by a language, we could solve that as well with common property along with backtracking.

If number is stand alone, then we don’t know what to do so either speak with previous language or speak with NVDA language selection.

From: Takuya Nishimoto [mailto:notifications@github.com] Sent: Monday, August 21, 2017 5:13 AM To: nvaccess/nvda nvda@noreply.github.com Cc: dineshkaushal dineshkaushal@gmail.com; Mention mention@noreply.github.com Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)

I have learned more about your code. I am still not sure how voice language (aka default language) and prerefenres should be used. For example, this test, written by me, fails. It is because second parameter of detectLanguage has higher priority than preferred languages, so Number always respects the voice language. Is it relevant or not?

    def test_case1(self):
           combinedText = u"Windows 10 OCR"
           config.conf["languageDetection"]["preferredLanguages"] = ("ja",)
           languageDetection.updateLanguagePriorityFromConfig()
           detectedLanguageSequence = languageDetection.detectLanguage(combinedText, "en_US")
           self.compareSpeechSequence(detectedLanguageSequence, [
                   LangChangeCommand("en"),
                   u"Windows ",
                   LangChangeCommand("ja"),
                   u"10 ",
                   LangChangeCommand("en"),
                   u"OCR"
           ])
           config.conf["languageDetection"]["preferredLanguages"] = ()
           languageDetection.updateLanguagePriorityFromConfig()

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-323619495 , or mute the thread https://github.com/notifications/unsubscribe-auth/AE08v8ZaZSoQLHPDM2RTjbNLZGmW5XODks5saMRxgaJpZM4LHyH_ . https://github.com/notifications/beacon/AE08v0EACipcE1wmUcj7_HnxXYq6vIlxks5saMRxgaJpZM4LHyH_.gif

nishimotz commented 7 years ago

Thank you for clarifications regarding preferences.

I made new pull request which only adds tests regarding Japanese.

dineshkaushal commented 7 years ago

@nishimotz I have included the test cases. Should I assume that these test cases are what Japanese users expect from NVDA language detection? Or do you propose any change regarding how we handle the numbers?

As per your previous comment, “windows 10 OCR” should be read in English.

So do you propose that we should go by either synthesizer language or Language selected in NVDA?

I also request others for their suggestion about this issue.

From: Takuya Nishimoto [mailto:notifications@github.com] Sent: Monday, August 21, 2017 7:12 PM To: nvaccess/nvda nvda@noreply.github.com Cc: dineshkaushal dineshkaushal@gmail.com; Mention mention@noreply.github.com Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)

Thank you for clarifications regarding preferences.

I made new pull request which only adds tests regarding Japanese.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-323746718 , or mute the thread https://github.com/notifications/unsubscribe-auth/AE08v82gyNKouAai9TaYYB7cSnZ-Zgmfks5saYkqgaJpZM4LHyH_ . https://github.com/notifications/beacon/AE08v7HVJwBN27-dmShstLWJrX6dbGUdks5saYkqgaJpZM4LHyH_.gif

nishimotz commented 7 years ago

So far, Japanese language users can accept the behavior of current implementation, I think.

mohdshara commented 7 years ago

could you summarize what work needs to be done before you consider send this as a BR to be reviewed? For Arabic this works as expected, and it seems this is true for Japanese too.

dineshkaushal commented 6 years ago

Are we going to get this in 2017.4?

zstanecic commented 6 years ago

i am afraid, no

@josephsl,

@mdcurran

W dniu 23.11.2017 o 15:39, dineshkaushal pisze:

Are we going to get this in 2017.4?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-346634793, or mute the thread https://github.com/notifications/unsubscribe-auth/AKohk6uO4ZeHsxo86FoPj50FW0YVGb30ks5s5YOJgaJpZM4LHyH_.

dineshkaushal commented 6 years ago

I don’t understand why? I had submitted it almost a month and half ago with unit tests?

zstanecic commented 6 years ago

because we have now an rc.

but wait for the mick’s statement.

W dniu 23.11.2017 o 16:25, dineshkaushal pisze:

I don’t understand why? I had submitted it almost a month and half ago with unit tests?

From: zstanecic [mailto:notifications@github.com] Sent: Thursday, November 23, 2017 8:34 PM To: nvaccess/nvda nvda@noreply.github.com Cc: dineshkaushal dineshkaushal@gmail.com; Mention mention@noreply.github.com Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)

i am afraid, no

@josephsl,

@mdcurran

W dniu 23.11.2017 o 15:39, dineshkaushal pisze:

Are we going to get this in 2017.4?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-346634793, or mute the thread

https://github.com/notifications/unsubscribe-auth/AKohk6uO4ZeHsxo86FoPj50FW0YVGb30ks5s5YOJgaJpZM4LHyH_.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-346640886 , or mute the thread https://github.com/notifications/unsubscribe-auth/AE08v6-sFDRQdcn9KIK7JGNZxuQPBzJxks5s5YlegaJpZM4LHyH_ . https://github.com/notifications/beacon/AE08v_0885i94CHU0y3NE3BA_jiWm8Nwks5s5YlegaJpZM4LHyH_.gif

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-346645855, or mute the thread https://github.com/notifications/unsubscribe-auth/AKohkxhVVahlfG8hntPP99ZmhuaUCl55ks5s5Y5sgaJpZM4LHyH_.

feerrenrut commented 6 years ago

Yes, it's now too late for this change to go into 2017.4. This is perhaps best anyway, the associated PR ( #7629 ) is a large change, which will take some time to review and given the nature of the change, it will be good for many people to use it via master and next builds before it goes into a release

dineshkaushal commented 6 years ago

Ok, would wait for comments after the review.

From: Reef Turner [mailto:notifications@github.com] Sent: Monday, November 27, 2017 1:22 PM To: nvaccess/nvda nvda@noreply.github.com Cc: dineshkaushal dineshkaushal@gmail.com; Mention mention@noreply.github.com Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)

Yes, it's now too late for this change to go into 2017.4. This is perhaps best anyway, the associated PR ( #7629 https://github.com/nvaccess/nvda/pull/7629 ) is a large change, which will take some time to review and given the nature of the change, it will be good for many people to use it via master and next builds before it goes into a release

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-347103232 , or mute the thread https://github.com/notifications/unsubscribe-auth/AE08v_bTc8JdUpMXEJtTmBgyLIhni4v6ks5s6moFgaJpZM4LHyH_ . https://github.com/notifications/beacon/AE08v2W04uNzH-RKpDblOPUeBr_4tJjYks5s6moFgaJpZM4LHyH_.gif

Adriani90 commented 5 years ago

@dineshkaushal are you still considering to continue your work on this? It would be highly appreciated. Since there has been put a lot of work in that PR, it would be really too bad if this is discontinued. Now that NVDA has been migrated to Python 3, I guess the PR is not compatible anymore.

Adriani90 commented 1 year ago

cc: @mltony

ruifontes commented 1 year ago

I think we should implement this...

mltony commented 1 year ago

I already implemented this feature in Tony's enhancements add-on. However for NVDA core I would argue that we can take this idea a step further and make use of a language detection library in order to distinguish languages properly, e.g. distinguishing English from German, which is not possible with just Unicode character analysis. VoiceOver can already do this. My cursory googling revealed multiple options available: https://towardsdatascience.com/4-python-libraries-to-detect-english-and-non-english-language-c82ad3efd430 I vaguely remember seeing someone has published an add-on for this on nvda-addons mailing list a while ago, but not sure if it's still around. I am currently too busy to work on this, but early next year I will have a few months off from work, so if nobody implements this feature by then - I will consider implementing this myself - if NVDA devs don't mind.

ruifontes commented 1 year ago

Hello!

This is a big message...

In a conversation with mohammad suliman mohmad.s93@gmail.com:

mohammad suliman wrote: delighted to announce that we are working on reintroducing the magnificent work done by Dinesh Caushel in pull request #7629. The PR was closed by lack of activity, and we wish to introduce a new one with improvements, and while taking into account the requests for changes by Reef on the previous PR also.

Good! I can cooperate in several tasks, but not coding, since my skills are far away from yours!

You wrote:

First, we want to highlight that this PR is very needed for us multilingual users. Last release, the add-on most of our community relied on stopped working with regards to auto language switching, so some of us opted to not update to the new version until the add-on is fixed, and unfortunately some migrated to use other screen readers due to this kind of issues. What we are trying to convey is that the feature is very helpful for us multilingual users, so hopefully NV Access will triage it accordingly.

Yes, I know that and we try to make our Vocalizer NVDA compatible as soon as possible!

You wrote:

That means that if NVDA encounters a specific language, let's say Hebrew for the sake of the discussion, then it will continue to speak using this language including letters, symbols and punctuations, numbers, and emoji also using this language

- We think that this behavior is the prefered one for most users, but we are not sure that whether we need to make this behavior configurable using checkboxes in the interface, which will enable the user to choose whether symbols, numbers, emojis, and so on needs to be spoken using the surrounding text language or the default one

We have choosen the second way, making it configurable, since many users prefered to use hebrew numbers and symbols even if the text is in english...

You wrote: Regarding the interface, we propose the following:

- A new panel will be created for language detection feature, and it will be added to the category list in Preferences of course - As before, the panel will include the following components: - A list for the preferred languages for the user, where the order of the languages in the list is the order in which the mechanism will prioritize languages - A buttons for moving languages up and down in the list - A buttons for adding and removing languages from the list

As we have in Vocalizer, I will suggest a combobox to select the voice to use...

You wrote:

- We propose the following components to be added also: - A combobox for auto language switching with the following 3 options: - Off (auto language switching is disabled) - onn (switch languages according to document language properties) - Advanced (switch languages using Unicode character properties as well as document language properties)

I disagree, and suggest 4 options, including:

switch languages using Unicode character propertiesonly

This is because we found on the web a lot of pages coded as using english, when they really are in portuguese, spanish and so on...

You wrote: We want to highlight also that we kredit most of this work to Dinesh Caushil who has done a great and hard work on this task, and it hurts that the work hasn't been included in NVDA yet. The ideal scenario would be that Dinesh completes this work, but as said before that the PR was closed due to lack of activity, and we need this feature so much, so we decided to complete Dinesh's way.

If you want also to get some coding logic or GUI from Vocalizer Expressive, feel free to do it!

And, finally, one suggestion:

Why not use, after the language selection through the character set, one feature to try to get the correct language through a package named langDetect?

I have tried several similar tools, and this one proved to be the fastest and reliable to use...

With more than 4 words the results are almost perfect...

And it is easy to use in NVDA. It can get only the most probable language or a set of, I think, 3 most probable languages...

Here a small add-on I made to test..:

https://www.dropbox.com/s/3jesk88koae35sg/languageDetect_1.0_Gen.nvda-addon?dl=1

.

I could not understood totally the speech module to try to include this in our language switching mechanism...

The commands are:

NVDA+Shift+l": "getLang", NVDA+Control+Shift+l": "getLangs",

Sorry by writing in private, but I think is more produtive...

Best regards,

Rui Fontes Tiflotecnia, Lda and NVDA portuguese team

nvaccess / nvda

Automatic language detection based on unicode ranges #2990