joshy / striprtf

Stripping rtf to plain old text
http://striprtf.dev
BSD 3-Clause "New" or "Revised" License
94 stars 27 forks source link

rtf_to_text() converts RTF cp1252 russian text bad #50

Open svladimirs opened 11 months ago

svladimirs commented 11 months ago

striprtf 0.0.26

{\rtf1\ansi\ansicpg1251 {\rtf1\adeflang1025\ansi\ansicpg1251 rtf_to_text() converting RTFs cp1251 is well (Russian text).

{\rtf1\adeflang1025\ansi\ansicpg1252 But not cp1252: абвгдеёжзийклмнопрст -> àáâãäå¸æçèéêëìíîïðñò

encoding=... do not help.

This helps: https://ru.stackoverflow.com/questions/1145225/Ошибка-обработки-файлов-rtf-на-python?ysclid=lqagyqz7x5798462943 or rtf_to_text(rtf.read()).encode('cp1252').decode('ansi') test-rus.zip

joshy commented 11 months ago

Hi, according to wikipedia cyrilic rtf should be encoded in cp1251 and not in cp1252. If I change the rtf content to cp1251 it works fine. cp1252 is the western encoding.

svladimirs commented 11 months ago

MS Word 2016 (test-2016.zip) save with 1251, but new MS Word 2021 (or below) after 2016 save as 1252 (test-rus).

stevengj commented 11 months ago

If a file (whether it's RTF or any other encoding) lists the wrong encoding, you are going to get mojibake … I don't think there's anything striprtf can realistically do about buggy RTF files.

joshy commented 11 months ago

I have created a small test case myself with word 365 and indeed it saves it with encoding 1252. I have no idea how in this case word finds out which is the right encoding. Some online rtf viewers (https://products.groupdocs.app/de/viewer/rtf, https://jumpshare.com/viewer/rtf) are also able to display the content correctly. Also Wordpad shows it correctly. The question is how do they figure out the right encoding?

svladimirs commented 11 months ago

Thanks. I did like this: decoded = rtf_to_text(rtf) try: decoded = decoded.encode('cp1252').decode('ansi') except: pass

joshy commented 11 months ago

@svladimirs: Glad you got a workaround. If I am running your code I get: LookupError: unknown encoding: ansi. How can this run?

stevengj commented 11 months ago

The question is how do they figure out the right encoding?

Maybe they do charset detection?

joshy commented 11 months ago

The question is how do they figure out the right encoding?

Maybe they do charset detection?

I tried the chardet library and it told me with nearly 80% confidence that the encoding is ISO-8859-8 which is Hebrew. What I tried:

y = 'àáâãäå¸æçèéêëìíîïðñò\n'.encode('cp1252')
import chardet
chardet.detect(y)
>>>{'encoding': 'ISO-8859-8', 'confidence': 0.7950708952163513, 'language': 'Hebrew'}
svladimirs commented 11 months ago

@joshy, You're probably using python < 3.6. See 7.2.4.1. Text Encodings. https://docs.python.org/3.5/library/codecs.html https://docs.python.org/3.6/library/codecs.html Let's replace ansi -> mbcs.

From that link on stackoverflow author used .encode('iso-8859-1').decode('cp1251'), but I tried to write universal code. 'iso-8859-1' replaced by me to 'cp1252' because def rtf_to_text(text, encoding="cp1252", errors="strict").

https://learn.microsoft.com/en-us/cpp/text/support-for-multibyte-character-sets-mbcss?view=msvc-170 "For platforms used in markets whose languages use large character sets, the best alternative to Unicode is MBCS". I thought ansi (mbcs) would be more versatile than cp1251.

What do you think about: def rtf_to_text(text, encoding="mbcs", errors="strict"). Will it work? Maybe the problem with 1251/1252 will go away?

joshy commented 11 months ago

@svladimirs As you can see I am using python 3.9.

image

Regarding to your proposals:

svladimirs commented 11 months ago

Well, mbcs won't work either... Then: decoded = rtf_to_text(rtf) try: decoded = decoded.encode('cp1252').decode('cp1251') except: pass

joshy commented 10 months ago

As a library I can't do that, you as a user can do that. The reason is that the specified encoding in the rtf file is correct and the library would convert it to a wrong encoding.