joshy / striprtf

Stripping rtf to plain old text
http://striprtf.dev
BSD 3-Clause "New" or "Revised" License
94 stars 27 forks source link

Option to change encoding error parameter #21

Closed PandasBear123 closed 3 years ago

PandasBear123 commented 3 years ago

Issue

Certain Unicode chars cause 'rtf_to_text()' to throw the below 'UnicodeEncodeError': 'UnicodeEncodeError: 'charmap' codec can't encode character '\u018f' in position 0: character maps to undefined'.

This occurred with the following chars: ['\u018f', '\u2003', '\u2002', '\u2008', '\u202f', '\ufb02', '\u0422', '\u200a', '\ufb01', '\u0131']

Fix

These exceptions can be caught however, it is tedious to solve the issue via this route especially given the fact they are mostly redundant chars that would be removed during data sanitisation.

A quick solution is to create an optional parameter to pass the already established 'error' parameter of 'encode()'.

Therefore errors can be avoided via the 'ignore' option and the rest of the script can run.

joshy commented 3 years ago

Hi Kemi,

thanks a lot for the pull request. I am considering merging in the changes, do you encountered the error in a real rtf document and can it maybe be shared?

Thanks and with best regards, Joshy

PandasBear123 commented 3 years ago

Unfortunately, the files I was working on are confidential - out of 140 rtf files 103 converted flawlessly. The files contain a variety of alphabets as they are in different languages. An example of a character that caused it to fail is 'Ə'.

joshy commented 3 years ago

Hi Kemi, thanks for the PR.

Thanks, Joshy