nvaccess / nvda

NVDA, the free and open source Screen Reader for Microsoft Windows
https://www.nvaccess.org/
Other
2.1k stars 634 forks source link

Compiling Liblouis with 32-bit Unicode support #6695

Closed AAClause closed 5 years ago

AAClause commented 7 years ago

In my opinion, the Liblouis DLL included in NVDA should be compiled with 32-bit Unicode. Currently, we improve the French table and we'd like to add any 32-bit Unicode such as \y1D4D0, \y1D49C, \y1D4D2, \y1D49E, etc. What do you think?

josephsl commented 7 years ago

Hi, have you contacted Liblouis folks about this possibility? Thanks.

From: André-Abush Clause [mailto:notifications@github.com] Sent: Wednesday, January 4, 2017 8:29 AM To: nvaccess/nvda nvda@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [nvaccess/nvda] Compiling Liblouis with 32-bit Unicode support (#6695)

In my opinion, the Liblouis DLL included in NVDA should be compiled with 32-bit Unicode. Currently, we improve the French table and we'd like to add any 32-bit Unicode such as \y1D4D0, \y1D49C, \y1D4D2, \y1D49E, etc. What do you think?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/6695 , or mute the thread https://github.com/notifications/unsubscribe-auth/AHgLkACqe_CV2gtBzv8GH_1rJZgTcW5Wks5rO8jZgaJpZM4Lazbp .

AAClause commented 7 years ago

@josephsl yes, it is possible. See https://github.com/liblouis/liblouis/blob/master/README.windows

Edit the file configure.mk If you want 32-bit unicode change the 2 in the line UCS=2 to a 4.

josephsl commented 7 years ago

CC @JcSTeh, @DKager, @LeonardDer

From: André-Abush Clause [mailto:notifications@github.com] Sent: Wednesday, January 4, 2017 8:43 AM To: nvaccess/nvda nvda@noreply.github.com Cc: Joseph Lee joseph.lee22590@gmail.com; Mention mention@noreply.github.com Subject: Re: [nvaccess/nvda] Compiling Liblouis with 32-bit Unicode support (#6695)

@josephsl https://github.com/josephsl yes, it is possible. See https://github.com/liblouis/liblouis/blob/master/README.windows

Edit the file configure.mk If you want 32-bit unicode change the 2 in the line UCS=2 to a 4.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/6695#issuecomment-270419286 , or mute the thread https://github.com/notifications/unsubscribe-auth/AHgLkN28Hz7C6gLCwbHTqmhFR70RvS61ks5rO8wQgaJpZM4Lazbp .

dkager commented 7 years ago

AFAIK that needs an update to the Python wrapper similar to what is done in Java. Another use case for this feature is support for emoji, e.g. U+1F170.

jcsteh commented 7 years ago
  1. We don't build using the liblouis build stuff for Windows; it's integrated into NVDA's own build system. However, it's easy enough to switch that to 32 bit Unicode.
  2. As @DKager notes, the bigger issue is that the Python bindings are currently tied to the OS Python Unicode width, which is 16 bits on Windows. We'd need to somehow make this optional in the upstream Python bindings based on how liblouis was compiled. That's going to be a bit tricky because on Windows, Python Unicode still uses 16 bits, so the bindings will have to figure out when and what to convert. We also need to make sure it doesn't break Unix. :)
jcsteh commented 7 years ago

@Andre9642, how important is this for French? Does the table fail to work with NVDA if you include these characters or do they just get ignored? Are these characters frequently used in French?

AAClause commented 7 years ago

@jcsteh:

Yes, the table fail to work with NVDA if I include these characters. Moreover Lou_checktable indicate that the table is invalid:

\<table>:\<line>: error: liblouis has not been compiled for 32-bit Unicode

No, these characters aren't frequently used in French. They are used in mathematics. In fact, our additions relate in part to Unicode Math Symbols. Extract:

uplow \x0391\x03b1 46-45-1,45-1 Α,α alpha uplow \x0392\x03b2 46-45-12,45-12 Β,β beta uplow \x0393\x03b3 46-45-1245,45-1245 Γ,γ gamma ... #sign \y1D4D2 46-5-14 # MATHEMATICAL BOLD SCRIPT CAPITAL C ... #sign \y1D49F 46-5-145 # MATHEMATICAL SCRIPT CAPITAL D ...

jcsteh commented 7 years ago

P2 because this prevents tables from even working in NVDA if they contain 32 bit characters.

bhavyashah commented 7 years ago

Just a gentle ping to bring the attention of NVDA code contributors to this P2 issue in case anyone desires to work on this.

dkager commented 7 years ago

As noted in https://github.com/nvaccess/nvda/issues/6695#issuecomment-270511629 this needs some upstream work first. (In general, I'm not too fond of how liblouis deals with Unicode.)

josephsl commented 7 years ago

Hi, we need this for Python 3 (#7105(. CC @JCSTeh

From: bhavyashah [mailto:notifications@github.com] Sent: Sunday, August 6, 2017 4:52 AM To: nvaccess/nvda nvda@noreply.github.com Cc: Joseph Lee joseph.lee22590@gmail.com; Mention mention@noreply.github.com Subject: Re: [nvaccess/nvda] Compiling Liblouis with 32-bit Unicode support (#6695)

Just a gentle ping to bring the attention of NVDA code contributors to this P2 issue in case anyone desires to work on this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/6695#issuecomment-320502162 , or mute the thread https://github.com/notifications/unsubscribe-auth/AHgLkDQJgjP_feXsQSQw-ySnBWT7NP21ks5sVajNgaJpZM4Lazbp .

jcsteh commented 7 years ago

@josephsl commented on 7 Aug 2017, 03:31 GMT+10:

Hi, we need this for Python 3 (#7105(.

No, we don't. Python 3 Unicode does handle 32 bit characters (actually, internally, i believe it's variable width now). However, Windows is still UTF-16 and thus ctypes still treats Unicode in C functions as UTF-16, since we're running on Windows. So, this is not required for Python 3.

jcsteh commented 7 years ago

For anyone that does want to try to implement this into the liblouis Python bindings, here are some rough thoughts:

  1. The Python bindings need to know the number of bits used for widechar (16 or 32) by the compiled liblouis. That'll need to be an autotools macro expansion, just as is done for ###LIBLOUIS_SONAME###. (NVDA's build system will also need to handle this expansion as it does for ###LIBLOUIS_SONAME###.)
  2. This determines the encoding to use when passing strings to and from liblouis. If it's 16, we must use UTF-16. If it's 32, we must use UTF-32.
  3. It's probably simplest to tell ctypes to use c_char_p instead of c_wchar_p so that we get raw bytes.
  4. Wherever ctypes.sizeof(ctypes.c_wchar) is used, instead use the number of bits from 1) divided by 8 for bytes.
  5. When passing unicode strings to liblouis, first encode them using the encoding from 2); e.g. inbuf.encode("UTF-32")
  6. After obtaining unicode strings from liblouis, decode them using the encoding from 2) before returning them; e.g. outbuf.value.decode("UTF-32")

Note that this proposed implementation eliminates the need to consider the ctypes unicode width altogether. However, it may be slightly less optimal when the ctypes unicode width does match the liblouis unicode width because it always uses a Python byte string as an intermediary, where perhaps ctypes is more efficient internally; I'm not sure. You could get around this by checking the ctypes unicode width with ctypes.sizeof(ctypes.c_wchar) and only using the proposed new behaviour if it differs from the liblouis width. However, this will make the code somewhat more complicated (especially with regard to testing on different configurations) and is probably premature optimisation unless a significant speed difference can be proved.

dkager commented 7 years ago

There is also a function to get the widechar size, though that would require a library call from the wrapper:

/**
 * Return the size of widechar
 */
LIBLOUIS_API
int EXPORT_CALL lou_charSize(void);

A much more ambitious solution would be to make liblouis work with UTF-16 as opposed to UCS-2, i.e. add support for surrogate characters.

LeonarddeR commented 6 years ago

I'm going to give this a try.

LeonarddeR commented 6 years ago

@jcsteh commented on 7 aug. 2017 01:04 CEST:

  1. The Python bindings need to know the number of bits used for widechar (16 or 32) by the compiled liblouis. That'll need to be an autotools macro expansion, just as is done for ###LIBLOUIS_SONAME###. (NVDA's build system will also need to handle this expansion as it does for ###LIBLOUIS_SONAME###.)

Shouldn't we get rid of autotools macro expansions in python bindings if possible?

jcsteh commented 6 years ago

Ideally, I guess so, but this piece of information is inherently tied to the specific build of liblouis. One way you could work around that is to have some liblouis function which returns this information, but I'm not sure whether that'd be acceptable to the maintainers. Given that we already have LIBLOUIS_SONAME and I really don't see how we could get rid of that, I think it'd be less friction to just go the ugly path and use another macro expansion, but I'm not the one implementing this. :)

LeonarddeR commented 6 years ago

@jcsteh commented on 30 aug. 2018 18:58 CEST:

One way you could work around that is to have some liblouis function which returns this information, but I'm not sure whether that'd be acceptable to the maintainers.

There is such a function as @dkager pointed out in https://github.com/nvaccess/nvda/issues/6695#issuecomment-320570322

@jcsteh commented on 7 aug. 2017 01:04 CEST:

  1. This determines the encoding to use when passing strings to and from liblouis. If it's 16, we must use UTF-16. If it's 32, we must use UTF-32.

For reference, we should explicitly use the little endian versions of utf-16 and utf-32.

  1. After obtaining unicode strings from liblouis, decode them using the encoding from 2) before returning them; e.g. outbuf.value.decode("UTF-32")

Now, that fails, since for c_char_p, the value is truncated as soon as a null character is encountered. Even when we would use c_wchar_p on Windows, the value will be truncated for cases like 'h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00'

LeonarddeR commented 6 years ago

Just filed https://github.com/liblouis/liblouis/pull/633. I did not yet bother to create a try build due to several issues, including liblouis 3.7 refusing to build when building with NVDA's build system. Feedback is welcome though.

dkager commented 6 years ago

Does their own NMake-based build still work? That seems to break every other release too.

LeonarddeR commented 6 years ago

@jcsteh commented on 7 aug. 2017 00:25 CEST:

@josephsl commented on 7 Aug 2017, 03:31 GMT+10:

Hi, we need this for Python 3 (#7105(.

No, we don't.

Actually, I'm afraid @josephsl has a point.

>>> import louis
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "d:\Development\NVDA\source repo\source\louis\__init__.py", line 358, in <module>
    raise Exception("mismatch in Unicode width: liblouis is 2 and python isn't")
Exception: mismatch in Unicode width: liblouis is 2 and python isn't
jcsteh commented 6 years ago

@leonardder commented on Sep 4, 2018, 6:44 PM GMT+10:

  File "d:\Development\NVDA\source repo\source\louis\__init__.py", line 358, in <module>
    raise Exception("mismatch in Unicode width: liblouis is 2 and python isn't")

Sure, but that's because the bindings don't yet support both bit widths (regardless of Python bit width). Once that's fixed, this check can be removed from the bindings. That should be done as part of liblouis/liblouis#633.

LeonarddeR commented 6 years ago

That's already covered in https://github.com/liblouis/liblouis/pull/633 indeed.

LeonarddeR commented 5 years ago

Ugh, there's one major thing I didn't think about yet. When compiling liblouis with UCS-4, cursor position breaks in a major way. This is because NVDA provides the cursor position based on the internal python 2 unicode representation, which has two offsets for characters involving surrogates in UTF-16, such as 😊. A switch to Python 3 would automatically fix this I believe, since Python 3 unicode strings behave similar to 4 byte widechar strings (i.e. one 4 byte liblouis widechar corresponds with one offset in a python 3 string). We could work around this by working with an internal UTF-32 representation of the text, but that feels a bit hacky,

zstanecic commented 5 years ago

Hi,

As i know, this can break braille extender.

mailto:reply@reply.github.com @Andre9642/BrailleExtender

From: Leonard de Ruijter notifications@github.com Sent: Tuesday, December 4, 2018 9:10 AM To: nvaccess/nvda nvda@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [nvaccess/nvda] Compiling Liblouis with 32-bit Unicode support (#6695)

Ugh, there's one major thing I didn't think about yet. When compiling liblouis with UCS-4, cursor position breaks in a major way. This is because NVDA provides the cursor position based on the internal python 2 unicode representation, which has two offsets for characters involving surrogates in UTF-16, such as 😊. A switch to Python 3 would automatically fix this I believe, since Python 3 unicode strings behave similar to 4 byte widechars (i.e. one 4 byte liblouis widechar corresponds with one offset in a python 3 string). We could work around this by calculating the cursor position based on the UTF-32 representation of the text.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nvaccess/nvda/issues/6695#issuecomment-444008438 , or mute the thread https://github.com/notifications/unsubscribe-auth/AKohkyY67JlukHZAO2lJ6taFiKj04jGCks5u1i2_gaJpZM4Lazbp . https://github.com/notifications/beacon/AKohk0Ib7jj4AInJzLHYlFqrSpdU3TBpks5u1i2_gaJpZM4Lazbp.gif

LeonarddeR commented 5 years ago

I've found a way to do this anyway. A pull request will be filed shortly.

Adriani90 commented 5 years ago

@leonardder Should this issue be closed now that #9544 has been merged?

LeonarddeR commented 5 years ago

Fixed in #9544