Closed mzbik closed 1 year ago
Hmm, I just had the situation where the unicode representation was escaped with \E
. So before passing it to rtf_to_text I did a replacement of all \E
with ''
(text.replace("\\E", "")
) which solved it for me. Can you share your example?
The text is at the end of the issue report.
On Dec 27, 2022, at 06:44, Joshy Cyriac @.***> wrote:
Hmm, I just had the situation where the unicode representation was escaped with \E. So before passing it to rtf_to_text I did a replacement of all \Ewith '' (text.replace("\E", "")) which solved it for me. Can you share your example?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.
I have no windows machine at hand, but on Mac the output of TextEdit and striprtf (version 0.0.22) gives the same result: Test note not with non-latin content: 'D0‘91'D0”94'D0“93
Text of the string is below. The \'D0 char is incorrectly encoded for UTF-8. Problem is that there are \'hex and \unicode chars in addition to straight ascii. Not sure how to handle this correctly; the RTF may have a sequence of \'hex that describes some unicode encoding as opposed to using \unicode representation. MSWord somehow manages to process this correctly; probably by assuming that \'hex is a single character and not part of some byte stream that could be a UTF-16 representation. The wikipedia is pretty clear that rtf is a stream of ascii >characters< and that the \'hex form is a way of specifying a >character< bigger than 127.
I could be wrong ;)
I think you might need to go to a high-precision intermediate representation before converting. Yes, it's another pass over the data.
{\rtf1\epicV10300\ansi\spltpgpar\jexpand\noxlattoyen \deff1\paperw12240\paperh15840\margl1800\margr1800\margt1440\margb1440\ftnbj {\fonttbl{\f1\fcharset0\fmodern SEGOE UI;}} \sectd \pard \plain\ltrch\fs22 Test note not with non-latin content: \'D0\u8216 \'91\'D0\u8221 \'94\'D0\u8220 \'93 }