joshy / striprtf

Stripping rtf to plain old text
http://striprtf.dev
BSD 3-Clause "New" or "Revised" License
94 stars 27 forks source link

encoding for Chinese characters #28

Closed yilu1015 closed 2 years ago

yilu1015 commented 2 years ago

Issue: Chinese characters not properly decoded.

Test file: test-with-chinese-characters.rtf.zip

Code

with open ('test-with-chinese-characters.rtf') as document:
    content = rtf_to_text(document.read())
    print (content)

Output:

Ó¡Ë¢Çé¿ö·´Ó³£º
201-003-00155 (Multiple)

ÊÐÕþ¸®Çé¿ö·´Ó³£º
022-021-00768 (Multiple)

Expected:

印刷情况反映:
201-003-00155 (Multiple)

市政府情况反映:
022-021-00768 (Multiple)
joshy commented 2 years ago

Hi, the rtf file has the wrong encoding, the correct encoding would be ansicpg936. I need to revert the changes made for this fix otherwise all other encodings will not work.