Open svladimirs opened 11 months ago
Hi, according to wikipedia cyrilic rtf should be encoded in cp1251 and not in cp1252. If I change the rtf content to cp1251 it works fine. cp1252 is the western encoding.
MS Word 2016 (test-2016.zip) save with 1251, but new MS Word 2021 (or below) after 2016 save as 1252 (test-rus).
If a file (whether it's RTF or any other encoding) lists the wrong encoding, you are going to get mojibake … I don't think there's anything striprtf can realistically do about buggy RTF files.
I have created a small test case myself with word 365 and indeed it saves it with encoding 1252. I have no idea how in this case word finds out which is the right encoding. Some online rtf viewers (https://products.groupdocs.app/de/viewer/rtf, https://jumpshare.com/viewer/rtf) are also able to display the content correctly. Also Wordpad shows it correctly. The question is how do they figure out the right encoding?
Thanks. I did like this: decoded = rtf_to_text(rtf) try: decoded = decoded.encode('cp1252').decode('ansi') except: pass
@svladimirs: Glad you got a workaround. If I am running your code I get: LookupError: unknown encoding: ansi. How can this run?
The question is how do they figure out the right encoding?
Maybe they do charset detection?
The question is how do they figure out the right encoding?
Maybe they do charset detection?
I tried the chardet library and it told me with nearly 80% confidence that the encoding is ISO-8859-8 which is Hebrew. What I tried:
y = 'àáâãäå¸æçèéêëìíîïðñò\n'.encode('cp1252')
import chardet
chardet.detect(y)
>>>{'encoding': 'ISO-8859-8', 'confidence': 0.7950708952163513, 'language': 'Hebrew'}
@joshy, You're probably using python < 3.6. See 7.2.4.1. Text Encodings. https://docs.python.org/3.5/library/codecs.html https://docs.python.org/3.6/library/codecs.html Let's replace ansi -> mbcs.
From that link on stackoverflow author used .encode('iso-8859-1').decode('cp1251'), but I tried to write universal code. 'iso-8859-1' replaced by me to 'cp1252' because def rtf_to_text(text, encoding="cp1252", errors="strict").
https://learn.microsoft.com/en-us/cpp/text/support-for-multibyte-character-sets-mbcss?view=msvc-170 "For platforms used in markets whose languages use large character sets, the best alternative to Unicode is MBCS". I thought ansi (mbcs) would be more versatile than cp1251.
What do you think about: def rtf_to_text(text, encoding="mbcs", errors="strict"). Will it work? Maybe the problem with 1251/1252 will go away?
@svladimirs As you can see I am using python 3.9.
Regarding to your proposals:
def rtf_to_text(text, encoding="mbcs", errors="strict")
is only used as a proposal. If there is an encoding in the file itself, like in newer versions of word, the encoding used in the rtf file is takenWell, mbcs won't work either... Then: decoded = rtf_to_text(rtf) try: decoded = decoded.encode('cp1252').decode('cp1251') except: pass
As a library I can't do that, you as a user can do that. The reason is that the specified encoding in the rtf file is correct and the library would convert it to a wrong encoding.
striprtf 0.0.26
{\rtf1\ansi\ansicpg1251 {\rtf1\adeflang1025\ansi\ansicpg1251 rtf_to_text() converting RTFs cp1251 is well (Russian text).
{\rtf1\adeflang1025\ansi\ansicpg1252 But not cp1252: абвгдеёжзийклмнопрст -> àáâãäå¸æçèéêëìíîïðñò
encoding=... do not help.
This helps: https://ru.stackoverflow.com/questions/1145225/Ошибка-обработки-файлов-rtf-на-python?ysclid=lqagyqz7x5798462943 or rtf_to_text(rtf.read()).encode('cp1252').decode('ansi') test-rus.zip