joshy / striprtf

Stripping rtf to plain old text
http://striprtf.dev
BSD 3-Clause "New" or "Revised" License
94 stars 27 forks source link

rtf_to_text ignores the errors parameter #46

Closed powo closed 1 year ago

powo commented 1 year ago

The errors= Parameter to rtf_to_text is documented in docstrings and mentioned in several issues (#34, #27, #27) but it is completely ignored and not being passed to .decode(..) ... therefore leading to UnicodeDecodeErrorss.

stevengj commented 1 year ago

Do you have an example .rtf file that illustrates your problem?

powo commented 1 year ago

Here is an example:

>>> striprtf.rtf_to_text(r"{\rtf1\ansi\ansicpg0 T\'e4st}", encoding="utf-8", errors="replace")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/powo/Sync/dev/bat/.venv/lib/python3.11/site-packages/striprtf/striprtf.py", line 136, in rtf_to_text
    out += bytes.fromhex(hexes).decode(encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data

expected behavior would be, that the errors="replace" will ignore the error and replace the invalid character, like this:

>>> b'T\xe4st'.decode("utf-8", errors="replace")
'T�st'