nvaccess / nvda

NVDA, the free and open source Screen Reader for Microsoft Windows
https://www.nvaccess.org/
Other
2.12k stars 638 forks source link

Automatic language detection based on unicode ranges #2990

Open nvaccessAuto opened 11 years ago

nvaccessAuto commented 11 years ago

Reported by ragb on 2013-02-13 12:26

This is kind of a spin-of of #279.

As settled some time ago, this proposal aims to implement automatic text “language” detection for NVDA. The main goal of this feature is for users to read text in different languages (or better said, language families) using proper synthesizer voices. By using unicode character ranges, one can understand at least the language family of a bunch of text: Latine-based (english, german, portuguese, spanish, french,…),, cyrilic (russian, ukrainian,…), kanji (japanese, maybe korean, - I that already written but it is too much for my memory), greek, arabic (arabic, farsy), and others more.

In broad terms, the implementation of this feature in NVDA requires the addition of a detection module in the speech sub system, that intercepts speech commands and adds “fake” language commands for the synth to change language, based on changes on text characters. It is also needed an interface for the user to tell NVDA what particular language to choose for some language family, that is, what to assume for latin-based, what to assume for arabic-based characters, etc.

I’ve implemented a prototype of this feature in a custome vocalizer driver, with no interface to choose the “proper” language. Prliminary testing with arabic users, using arabi and english vocalizer voices, has shown good results, that is, people like the idea. Detection language code was adapted from the Guess_language module, removing some of the detection code which was not applicable (tri-gram detection for differentiating latin languages, for instance).

I’ll explain the decision to use, for now, only unicode based language detection. Language detection could also be done using trigrams (see here for instance), dictionaries, or other heuristics of that kind. However, the text that is passed each time for the synthesizer is very very small (a line of text, a menu name, etc), which makes these processes, which are probabilistic by nature, very very error-prone. From my testing, applying trigram detect for latin languages in NVDA showed completely unusable, further from adding a noticeable delay when speaking. For bigger text content (books, articles, etc.) it seems to work well, however I don’t know if this can by applied somehow in the future, say by analyzing virtuel buffers, or anything.

Regarding punctuation, digits, and other general characters, I’m defaulting to the current language (and voice) of the synth.

I’ll create a branch with my detection module integrated within NVDA, with no interface.

Regarding the interface for selecting what language to assume for each given language group (when applicable, greek, for instance, is only itself), I see a dialog with various combo boxes, each one for each language family, to choose the language to be used. I think restricting the available language choices from the available languages of the current synth may improve usability. I don’t know where to put that dialog, or what to call it (“language detection options”?).

Any questions please ask.

Regards,

Rui Batista Blocked by #5427, #5438

nvaccessAuto commented 11 years ago

Comment 1 by jteh on 2013-02-13 12:29 Is this technically a duplicate of #1606? (If so, we'd probably close #1606, since this one contains more technical detail.)

nvaccessAuto commented 11 years ago

Comment 2 by ragb (in reply to comment 1) on 2013-02-13 12:37 Replying to jteh:

Is this technically a duplicate of #1606? (If so, we'd probably close #1606, since this one contains more technical detail.)

I thin #1606 is only related with ponctuation, although, to be honest, I don't understand that ticket's description that well.

nvaccessAuto commented 11 years ago

Comment 3 by Ahiiron on 2013-05-21 14:35 I think for usability and reliability as you said, the user would probably configure languages to auto-switch to, like the Vocalizer implementation.

nvaccessAuto commented 9 years ago

Comment 4 by dineshkaushal on 2015-07-13 05:30 Please check auto language detection.

There is a Writing Script dialog within preferences menu. This dialog has options to add/ remove and move up and down languages. I tested with 2 Devanagari languages Hindi and marathi, and I could get the proper language code for those languages in the log.

Code is in branch in_t2990

nvaccessAuto commented 9 years ago

Comment 6 by dineshkaushal on 2015-08-17 19:16 In this round, the adjacent ranges are merged, code is reorganized, option to ignore language detection for language specified by document is added, detailed review of sequence is done and comments are improved. There are 2 branches, in_t2990 branch with iso 15924 script codes with a bit more complicated code and presumably fast code, and in_t2990_simple with iso codes removed with simple code and hopefully not slow code.

nvaccessAuto commented 9 years ago

Comment 7 by jteh on 2015-09-21 05:09 Note: there was a round of code review which was unfortunately lost resultant to the recent server failure. However, the review was addressed. The following relates to the most recent changes.

Thanks for the changes, Dinesh. This looks pretty good. A few things:

gui

unicodeScriptHandler

unicodeScriptPrep

Documentation

Thanks!

nvaccessAuto commented 9 years ago

Comment 8 by dineshkaushal on 2015-09-28 08:11 Fixed all the code related issues. I have not yet added the documentation; I will do it once the code is ok. Should I modify userGuide.html?

nvaccessAuto commented 9 years ago

Comment 9 by dineshkaushal on 2015-10-07 13:34 Added documentation for Writing Scripts section in configuring NVDA main section.

nvaccessAuto commented 9 years ago

Comment 10 by James Teh <jamie@... on 2015-10-18 23:55 In commit eb09127eae149c2e47a862af8e403bf78b594896: Merge branch 't2990' into next

Incubates #2990. Changes: Added labels: incubating

nvaccessAuto commented 9 years ago

Comment 11 by jteh on 2015-10-19 01:22 Thanks. I made quite a few changes before incubating. Here are the significant ones:

nvaccessAuto commented 9 years ago

Comment 12 by MarcoZehe on 2015-10-19 10:46 This has some unwanted side effects: The latin unicode range seems to be hard-coded to English, but the range may also include French, German, and other European languages. In my case, I am bilingually working in English and German contexts all day. So even when my Windows is set to English, my synthesizer is usually set to the German voice, because I can stand the German voice speaking English, but I cannot stand the English voice, of any synthesizer, try to speak German.

In consequence: If I try to set my synth to German Anna in the Vocalizer 2.0 for NVDA, it will still use the English Samantha voice for most things, even German web pages. I have to turn off language detection completely to get my old functionality back. This will, of course, also take away the language switching where the author did use correct lang attributes on web sites or in Word documents.

nvaccessAuto commented 9 years ago

Comment 14 by James Teh <jamie@... on 2015-10-19 11:59 In commit 6fd9ad34fc7a422418b21abbdc48034ac3687f9b: Merge branch 't2990' into next: Hopefully fixed problems which caused the voice language not to be preferred for language detection.

Incubates #2990.

nvaccessAuto commented 9 years ago

Comment 16 by nishimotz on 2015-10-19 12:32 I have tested nvda_snapshot_next-12613,8dbd961 with an add-on version of Japanese TTS, which is developed by me and supports LangChangeCommand.

For example, the word 'Yomu' ('read' in Japanese) usually consists of two characters, 0x8aad and 0x3080.

読む

The first one is ideographic character (Chinese letter), and the second is phonetic character (Hiragana).

To give correct pronunciation, Japanese TTS should take the two characters at the same time, because the reading of Chinese character is context-dependent in Japanese language.

With this version of NVDA, the two letters are pronounced separately, so the reading of first letter is wrong. If automatic language detection is turned off, the issue does not occur.

In the unicodeScriptData.py, it seems that 0x8aad is in the range of "Han", and 0x3080 is "Hiragana". For Japanese language, they should be treated as single item in the detectedLanguageSequence.

nvaccessAuto commented 9 years ago

Comment 18 by jteh (in reply to comment 16) on 2015-10-26 11:04 Dinesh, thoughts on comment:16?

nvaccessAuto commented 9 years ago

Comment 19 by nvdakor on 2015-10-27 07:51 Hi, To whoever coded lang detection dialog: may I suggest some GUI changes:

nvaccessAuto commented 9 years ago

Comment 20 by nvdakor on 2015-10-27 07:53 Hi, On second thoughts, I'd wait until the fundamentals are done (including fixing comment 16) before pushing GUI changes.

nvaccessAuto commented 9 years ago

Comment 21 by mohammed on 2015-10-27 13:47 hi.

another GUI change would be to only have a close button. I don't think OK and cancel are functional in this dialogue box. thoughts?

on another note, since #5427 is closed as fixed, I think it should be removed from the blocking tickets?

thanks.

nvaccessAuto commented 9 years ago

Comment 22 by jteh on 2015-10-28 00:41 Holding this back for 2015.4, as there are outstanding issues, and even if they are fixed, there won't be sufficient time for them to be tested. Changes: Milestone changed from near-term to None

nvaccessAuto commented 9 years ago

Comment 23 by jteh (in reply to comment 21) on 2015-10-28 00:51 Replying to mohammed:

another GUI change would be to only have a close button. I don't think OK and cancel are functional in this dialogue box.

They should be. Cancel should discard any changes you make (e.g. removing a language you didn't intend to remove), whereas OK saves them.

on another note, since #5427 is closed as fixed, I think it should be removed from the blocking tickets?

No, it shouldn't. Blocking indicates whether another ticket was required for this one, whether it's fixed yet or not. If it is fixed, it's still useful to know that it was required.

nvaccessAuto commented 9 years ago

Comment 24 by dineshkaushal on 2015-10-28 07:32 Regarding comment 16:

The problem of han and Hiragana is occurring because our algorithm assumes that each language has only one script. One possible solution is that during unicodeData building we can name all han and hiragana characters as something HiraganaHan and then add language to script mapping for Japanese as HiraganaHan we could do the same for chinese and Korean.

Another solution is that we could create script groups and add a check for script groups for each character and do not split strings for script groups.

Could anyone explain what scripts are relevant for Japanese, Chinese and Korean languages? and how various scripts combine for these languages.

Alternatively a reliable reference for a resource.

nvaccessAuto commented 9 years ago

Comment 26 by nishimotz on 2015-10-28 08:49 Speaking from conclusion, the approach of DualVoice addon is much useful for Japanese language users:

I think such requirements are because of Japanese TTS and symbol dictionary, which already covers wider ranges of Unicode characters by historical reasons.

If such requirement is only for Japanese users, I will work around only for Japanese. However, I would like to hear from other language users who have similar requirements.

nvaccessAuto commented 9 years ago

Comment 27 by jteh on 2015-10-29 00:35 Note that switching to specific voices and synthesisers for specific languages is not meant to be covered here. We'll handle that separately, as among other things, it depends on speech refactor (#4877).

nvaccessAuto commented 9 years ago

Comment 28 by nishimotz on 2015-10-29 03:01 In Japan, there are some users of Vocalizer for NVDA.

https://vocalizer-nvda.com/docs/en/userguide.html#automatic-language-switching-settings

I am asking them to the usage of this functionality.

As far as I heard, automatic language switching based on content attribute and character code should be separately disabled for Japanese language users.

nvaccessAuto commented 9 years ago

Comment 29 by jteh (in reply to comment 28) on 2015-10-29 03:09 Replying to nishimotz:

In Japan, there are some users of Vocalizer for NVDA.

As far as I heard, automatic language switching based on content attribute and character code should be separately disabled for Japanese language users.

To clarify, do you mean that these users disable language detection (using characters), but leave language switching for author-specified language enabled? Or are you saying the reverse? Or are you saying that different users have different settings, but all agree both need to be toggled separately? How well doe sthe Vocalizer language detection implementation work for Japanese users?

For what it's worth, I'm starting to think we should allow users to disable language detection (i.e. using characters) separately. At the very least, it provides for a workaround if our language detection code gets it wrong. I'm not convinced it is necessary to separately disable author-specified language switching, though. If you disagree, can you explain why?

nvaccessAuto commented 9 years ago

Comment 30 by nishimotz on 2015-10-29 03:51 Author-specified language switching is useful for users of multilingual synthesizers, however it should be disabled in some cases.

For example, if a synthesizer supports English and Japanese, and if the actual content of a web site is written in Japanese characters, and the element is incorrectly attributed as lang='en', the content cannot be accessed at all, without turning off the author-specified language switching. Such websites have been reported by the NVDA users in Japan.

I am now investing the implementation of Vocalizer language detection by myself, however, I heard that they are only useful for working with multilingual materials.

nvaccessAuto commented 9 years ago

Comment 31 by nishimotz on 2015-10-29 12:41 As far as I have investigated, Vocalizer driver 3.0.12 covers various needs of Japanese NVDA users.

The important feature is: "Ignore numbers and common punctuation when detecting text language." Without this, automatic language detection based on characters is difficult to use with Japanese TTS.

By the way, it would be nice to allow disabling "language switching for author-specified language" and enabling "detect text language based on unicode characters" in some cases. Vocalizer for NVDA does not allow this so far.

For example, Microsoft Word already has ability of content language detection based on character code. For choosing visual appearance such as display font, this works very well. However, it would be very difficult to understand if NVDA voice languages are switched by such language attributes, because Japanese sentence usually contains half-width numbers or symbols and full-shape Japanese characters. To be correctly pronounced, they should be sent to Japanese TTS simultaneously.

I am now asking to some friends regarding this, but it seems Japanese users of Microsoft Word cannot use the language switching of NVDA because of this.

nvaccessAuto commented 9 years ago

Comment 32 by James Teh <jamie@... on 2015-11-02 05:30 In commit 2bba21c53cd925c36a836041d10c859b551cd506: Revert "NVDA now attempts to automatically detect the language of text to enable automatic language switching even if the author has not specified the language of the text. See the Language Detection section of the User Guide for details."

This is causing problems for quite a few languages and needs some additional work before it is ready. This reverts commits 60c25e83 and 72f85147. Re #2990.

nvaccessAuto commented 9 years ago

Comment 33 by jteh on 2015-11-02 05:31 Changes: Removed labels: incubating

nvaccessAuto commented 9 years ago

Comment 34 by mohammed on 2015-11-04 16:00 hi.

it'd be good if people here can try the automatic language implementation in the new ad-on from Codefactory. for me it works if I choose an English voice from NVDA's voice settings dialog box. the only annoyance for me is that I hear punctuation marks with the Arabic voice regardless of "Trust voice's language when processing characters and symbols" state.

Jamie, can we probably make this functionality that has been reverted available as an ad-on? because for me, it is the most successful implementation where my primary language is English and Arabic is a secondary. it worked perfectly for me.

nvaccessAuto commented 9 years ago

Comment 35 by jteh (in reply to comment 34) on 2015-11-04 22:24 Replying to mohammed:

it'd be good if people here can try the automatic language implementation in the new ad-on from Codefactory.

Do you mean that the Code FActory add-on includes it's wn language detection or do you mean you were trying an NVDA next build which included this functionality (before it was reverted)? I assume the second, but just checking.

Jamie, can we probably make this functionality that has been reverted available as an ad-on?

Unfortunately, no; it needs to integrate quite deeply into NVDA's speech code. However, work on this isn't being abandoned. It just needs more work before it's ready for wide spread testing again.

nvaccessAuto commented 9 years ago

Comment 36 by mohammed (in reply to comment 35) on 2015-11-04 23:02 Replying to jteh:

The new CodeFactory add-on has its own implementation of language detection: From NVDA's menu go to codefactory / Vocalizer, in the settings tab it has the following check box: "Language Detection".

Note that this implementation isn't open source; it's part of the Code Factory proprietary synthesiser. I'm still not clear as to whether you were happy with the internal implementation in NVDA or whether you preferred the Code Factory implementation.

mohdshara commented 7 years ago

can this be looked into again and be given a priority?

feerrenrut commented 7 years ago

Given that some work has already gone into this, hopefully this isn't too far away. Based on this I'll set it to priority 2. @jcsteh Could you please comment to summarise the work remaining here?

mohdshara commented 7 years ago

hi. Now that #7110 and #6159 are incubating I would like to inquire about this. is the planned speech refactor going to be beneficial towards automatic language switching? is this blocked by that work somehow? if not, can someone explain what is exactly needed for this to incubate?

Thanks for the wonderful work.

On 12/9/2016 6:53 AM, Reef Turner wrote:

Given that some work has already gone into this, hopefully this isn't too far away. Based on this I'll set it to priority 2. @jcsteh https://github.com/jcsteh Could you please comment to summarise the work remaining here?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-265920180, or mute the thread https://github.com/notifications/unsubscribe-auth/AGQl_PCMJ6pxZ30X0HPh9ur1zpbfEApwks5rGMK7gaJpZM4LHyH_.

jcsteh commented 7 years ago

Detection of language based on Unicode characters is separate from speech refactor. However, part of speech refactor would allow switching to a different synthesiser for specific languages; see #279. While related from a user perspective, these features should be considered separately.

As to what is blocking this, there are quite a few outstanding issues:

  1. Some languages contain characters from multiple scripts. The current algorithm does not handle this correctly. Raised in https://github.com/nvaccess/nvda/issues/2990#issuecomment-155304784, possible solutions discussed in https://github.com/nvaccess/nvda/issues/2990#issuecomment-155304792.
  2. Option to ignore common numbers and punctuation when detecting text language. (https://github.com/nvaccess/nvda/issues/2990#issuecomment-155304798)
  3. Ability to separately disable language switching based on author specified language (while leaving language detection based on Unicode characters enabled). (https://github.com/nvaccess/nvda/issues/2990#issuecomment-155304797, second part of https://github.com/nvaccess/nvda/issues/2990#issuecomment-155304798)
  4. Ability to disable language detection based on Unicode characters without disabling language switching based on author specified language. This would provide a workaround for cases where text detection gets it wrong, which it seems is inevitable for at least some use cases.
  5. Minor GUI issues (https://github.com/nvaccess/nvda/issues/2990#issuecomment-155304787).

Points 1 and 2 are going to be tricky. There's also an open question as to whether to try to adapt the language detection implementation in the Tiflotecnia Vocalizer driver (which appears to work well for some users), rather than further working on the implementation provided here.

@dineshkaushal, if I recall correctly, this is no longer something your team wants to pursue. Is that still correct?

LeonarddeR commented 7 years ago

I might be missing something, but I think it might help if there is an official pr for the code which has been reviewed earlier.

mohdshara commented 7 years ago

@jcsteh I think @leonardder's request is valid if we want to seek help from potential developers on this one?

jcsteh commented 7 years ago

@dineshkaushal is currently working on this. @dineshkaushal, it'd be good if you can provide status updates here. Thanks.

mohdshara commented 7 years ago

@dineshkaushal, Is there any help at all I can offer regarding this work?

dineshkaushal commented 7 years ago

Dear All,

After a long hibernation, I am back to work on this issue, and I am determined to close it asap.

I have made some fixes for detecting Japanese language, if @nishimotz could test it and let me know whether this build is fixing the issues?

If this logic fixes the bug, then I can add other languages that use multiple scripts such as Chinese.

Note: this is a build made on my system so it is not signed.

https://www.dropbox.com/s/xlylzf0outcjom6/nvda_snapshot_source-in_t2990_new-2dd9048.exe?dl=0

@mohdshara thank you for your offer for the support. Could you also test this and let me know your inputs?

From: James Teh [mailto:notifications@github.com] Sent: Monday, June 26, 2017 5:44 PM To: nvaccess/nvda nvda@noreply.github.com Cc: dineshkaushal dineshkaushal@gmail.com; Mention mention@noreply.github.com Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)

@dineshkaushal https://github.com/dineshkaushal is currently working on this. @dineshkaushal https://github.com/dineshkaushal , it'd be good if you can provide status updates here. Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-311042062 , or mute the thread https://github.com/notifications/unsubscribe-auth/AE08v3bd690EetzFyixQ8RAOsqzPvQXoks5sH6CigaJpZM4LHyH_ . https://github.com/notifications/beacon/AE08v8761n2RPUtzaxgAEa7VnwUmWawlks5sH6CigaJpZM4LHyH_.gif

bhavyashah commented 7 years ago

@mohdshara @nishimotz Could you please respond to https://github.com/nvaccess/nvda/issues/2990#issuecomment-319133078 as requested by @dineshkaushal by testing the build provided and sharing your feedback?

nishimotz commented 7 years ago

I will check it on the weekend. where can I see the branch corresponding to the nvda_snapshot_source-in_t2990_new-2dd9048.exe?

nishimotz commented 7 years ago

@dineshkaushal I have tested the binary build. When I have added Japanese to preferred language list at the language detection setting, error occurs as follows:

ERROR - queueHandler.flushQueue (18:57:12):
Error in func message from eventQueue
Traceback (most recent call last):
  File "queueHandler.pyc", line 50, in flushQueue
  File "ui.pyc", line 66, in message
  File "speech.pyc", line 124, in speakMessage
  File "speech.pyc", line 402, in speakText
  File "speech.pyc", line 508, in speak
  File "languageDetection.pyc", line 239, in detectLanguage
  File "languageDetection.pyc", line 130, in getLangID
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

I cannot understand how this language list will be used. Please let me review the code.

mohdshara commented 7 years ago

Hi. For me, if I install this build it gives me an error as soon as it's started. unfortunately, I only hear the error sound but couldn't know what was going on. the portable copy works, however, I think there's something missing. I can't use Windows One core voices and I remember we were able to chose which synth / voice is used for what language, isn't this true?

Thank you very much for coming back to this, and sorry for delay as I was on my annual vacation.

dineshkaushal commented 7 years ago

Thanks @nishimotz and @mohdshara https://github.com/mohdshara for giving it a try.

The priority language list is used to select a language for those languages that use the same script. For example, English and german would be using latin script so the user could add german to give priority to german whenever latin is there.

Similarly, Japanese and Chinese could use the same script, but I am not sure about what all scripts are being used by Chinese. So for now I have added same scripts for both.

I will provide the branch for you to make your own build.

From: Mohammed Al Shara [mailto:notifications@github.com] Sent: Friday, August 11, 2017 1:38 PM To: nvaccess/nvda nvda@noreply.github.com Cc: dineshkaushal dineshkaushal@gmail.com; Mention mention@noreply.github.com Subject: Re: [nvaccess/nvda] Automatic language detection based on unicode ranges (#2990)

Hi. For me, if I install this build it gives me an error as soon as it's started. unfortunately, I only hear the error sound but couldn't know what was going on. the portable copy works, however, I think there's something missing. I can't use Windows One core voices and I remember we were able to chose which synth / voice is used for what language, isn't this true?

Thank you very much for coming back to this, and sorry for delay as I was on my annual vacation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/2990#issuecomment-321753322 , or mute the thread https://github.com/notifications/unsubscribe-auth/AE08vzceERAgq3cUf9x-XZWNzj9XmfBEks5sXAvwgaJpZM4LHyH_ . https://github.com/notifications/beacon/AE08v32GbYIam2ZsIluI8J93njh6_8Vfks5sXAvwgaJpZM4LHyH_.gif

mohdshara commented 7 years ago

@dineshkaushal, on what build of nvda is this based? why isn't "Windows one core" voices an option? Also, it seems the implementation assumes that one synth supports the wanted language, it can't use voices across multiple synthesizers?

jcsteh commented 7 years ago

Supporting switching to voices across multiple synthesisers is a very separate (but related) issue which needs to be handled elsewhere. That's covered by #279. To test this, you'll need to be working with a single synthesizer which supports multiple languages.

mohdshara commented 7 years ago

@jcsteh thanks for the info, very useful. I still need to know why I can't use windows one core to test the try build though.

dineshkaushal commented 7 years ago

Please try the branch at

https://github.com/nvda-india/nvda/tree/in-t2990-review

This branch is based on latest master so there should not be any error.

nishimotz commented 7 years ago

@dineshkaushal please review my pull request on your repository regarding encoding issues.

I am still investigating regarding language detection.