UTF-16 character issues.

liblouis / liblouis-java

Java bindings for liblouis

GNU Lesser General Public License v3.0

4 stars 9 forks source link

UTF-16 character issues. #20

Closed kalaspuffar closed 3 years ago

kalaspuffar commented 3 years ago

Hi @bertfrees

Found an issue with one book and dotify library. When we tried to translate an English book with an alpha character. This char is a multi char that will give us one codepoint but multiple characters when asking for string length.

I will submit a test case that showcases the issue.

https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

Best regards Daniel

AlexanderHugestrand commented 3 years ago

More specifically, in this case we had two 16-bit values that encoded a single character and codepoint: \uD835\uDEFC

By following the instructions on the wikipedia page, I get that the values encode:

0xD835 - 0xD800 = 0x0035 = 53 in decimal 0xDEFC - 0xDC00 = 0x02FC = 764 in decimal

And the codepoint is: 2^16 + 53 * 2^10 + 764 = 120 572 = 0x1D6FC

https://unicode-table.com/en/#1D6FC

bertfrees commented 3 years ago

Thanks.

Actually liblouis-java expects that the length of the "characterAttributes" argument is the same as the length of the Java string (char array), not the number of code points. But I found out now that I was doing it all wrong and the "typeform" and "characterAttributes" arguments were just not working when the input had Unicode characters above U+FFFF.

Will be fixed in the next release.

bertfrees commented 3 years ago

Fixed by commits aa08131 and 4b9cc74.