Decoding non-ASCII chars to UTF-8 is problematic

mzbik commented 1 year ago

Text of the string is below. The \'D0 char is incorrectly encoded for UTF-8. Problem is that there are \'hex and \unicode chars in addition to straight ascii. Not sure how to handle this correctly; the RTF may have a sequence of \'hex that describes some unicode encoding as opposed to using \unicode representation. MSWord somehow manages to process this correctly; probably by assuming that \'hex is a single character and not part of some byte stream that could be a UTF-16 representation. The wikipedia is pretty clear that rtf is a stream of ascii >characters< and that the \'hex form is a way of specifying a >character< bigger than 127.

I could be wrong ;)

I think you might need to go to a high-precision intermediate representation before converting. Yes, it's another pass over the data.

{\rtf1\epicV10300\ansi\spltpgpar\jexpand\noxlattoyen \deff1\paperw12240\paperh15840\margl1800\margr1800\margt1440\margb1440\ftnbj {\fonttbl{\f1\fcharset0\fmodern SEGOE UI;}} \sectd \pard \plain\ltrch\fs22 Test note not with non-latin content: \'D0\u8216 \'91\'D0\u8221 \'94\'D0\u8220 \'93 }

joshy commented 1 year ago

Hmm, I just had the situation where the unicode representation was escaped with \E. So before passing it to rtf_to_text I did a replacement of all \Ewith '' (text.replace("\\E", "")) which solved it for me. Can you share your example?

mzbik commented 1 year ago

The text is at the end of the issue report.

On Dec 27, 2022, at 06:44, Joshy Cyriac @.***> wrote:

Hmm, I just had the situation where the unicode representation was escaped with \E. So before passing it to rtf_to_text I did a replacement of all \Ewith '' (text.replace("\E", "")) which solved it for me. Can you share your example?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

joshy commented 1 year ago

I have no windows machine at hand, but on Mac the output of TextEdit and striprtf (version 0.0.22) gives the same result: Test note not with non-latin content: 'D0‘91'D0”94'D0“93

joshy / striprtf

Decoding non-ASCII chars to UTF-8 is problematic #36