liblouis / liblouisutdml

An open-source library providing complete braille transcription services for xml, html and text documents
http://liblouis.io
GNU General Public License v3.0
24 stars 16 forks source link

Incorrectly encoding text when backFormat is text #68

Open mwhapples opened 4 years ago

mwhapples commented 4 years ago

When performing backtranslation with file2brl, if the configuration has backFormat set to text and the text resulting from the backtranslation contains unicode characters outside the ASCII range these will be incorrectly encoded. As an example, using en-ueb-g2.ctb as the translation table try back translating a word containing an apostrophe (eg. I'M, CAN'T, etc). This results in the apostrophe being produced as the byte 0x19. Having tested file2brl with backFormat set to html, it appears that in this example the apostrophe gets backtranslated to unicode character \u2019. I therefore suspect file2brl is simply removing the higher byte of the unicode characters when backFormat is set to text.