Closed PandasBear123 closed 3 years ago
Hi Kemi,
thanks a lot for the pull request. I am considering merging in the changes, do you encountered the error in a real rtf document and can it maybe be shared?
Thanks and with best regards, Joshy
Unfortunately, the files I was working on are confidential - out of 140 rtf files 103 converted flawlessly. The files contain a variety of alphabets as they are in different languages. An example of a character that caused it to fail is 'Ə'.
Hi Kemi, thanks for the PR.
Thanks, Joshy
Issue
Certain Unicode chars cause 'rtf_to_text()' to throw the below 'UnicodeEncodeError': 'UnicodeEncodeError: 'charmap' codec can't encode character '\u018f' in position 0: character maps to undefined'.
This occurred with the following chars: ['\u018f', '\u2003', '\u2002', '\u2008', '\u202f', '\ufb02', '\u0422', '\u200a', '\ufb01', '\u0131']
Fix
These exceptions can be caught however, it is tedious to solve the issue via this route especially given the fact they are mostly redundant chars that would be removed during data sanitisation.
A quick solution is to create an optional parameter to pass the already established 'error' parameter of 'encode()'.
Therefore errors can be avoided via the 'ignore' option and the rest of the script can run.