joshy / striprtf

Stripping rtf to plain old text
http://striprtf.dev
BSD 3-Clause "New" or "Revised" License
94 stars 27 forks source link

suspicious behavior #37

Closed nikita-petrashen closed 1 year ago

nikita-petrashen commented 1 year ago

Hey!

I was trying to decode an RFT file and rtf_to_text threw an error "Cannot decode a symbol \u0417 blah-blah". I tried passing several different encodings to rtf_to_text but it didn't work. Then I looked into the source code and found this line. I have uncommented it and tried passing "blahblah" as the encoding expecting that it'd throw me a print, but it didn't.

Then I started to suspect that this or this line actually overrides the encoding argument that I pass to the function. After removing these two lines and setting the correct encoding everything went smoothly.

Idk what's going on there but you should probably look into it as this feels wrong :).

nikita-petrashen commented 1 year ago

The text was a simple text in Russian with no special characters other than newlines and tabs. cp1251 encoding worked fine, but utf-8 didn't.

joshy commented 1 year ago

If the text worked with cp1251 then you can close the issue, right?

nikita-petrashen commented 1 year ago

it worked only after I removed the lines that I mentioned

joshy commented 1 year ago

Can you please sent me the sample rtf file?

plessbd commented 1 year ago

I am having the "same" issue

txt_str = rtf_to_text(rtf_str, encoding='utf-8', errors="strict")

my character is a combined <= symbol

the very basic rtf I can create:

{\rtf1\ansi\ansicpg1252\cocoartf2709
\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 ArialMT;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww11680\viewh15320\viewkind0
\deftab1134
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\tx9360\tx10080\pardeftab1134\partightenfactor0

\f0\fs18 \cf0 \uc0\u8804 \
\uc0\u8804 }
plessbd commented 1 year ago

so my issue might not be 100% related, but to get "around it" I forced the string to be a utf8 string all the time. You can see my change here: https://github.com/plessbd/striprtf/compare/master...handle-some-dirta, it "works for me" but I have not had the time to fully test it. It seems the rtf's I have (that I have no ability to "fix" in the source) have some extra characters in them that according to https://www.ee.ucl.ac.uk/~mflanaga/java/HTMLandASCIItableWin.html should work, but they dont show up in https://github.com/python/cpython/blob/main/Lib/encodings/cp1252.py things like less than + equal too

I am guessing the system that created this has some other things running that I do not know about.

stevengj commented 1 year ago

Looking at the code, it seems like if the encoding is X, then you first encode all of the output to X and then decode to Unicode at the end.

It seems like this will inherently cause problems any time the text contains Unicode characters that don't exist in encoding X.

For example, in @plessbd's example above, the RTF encoding is \ansicpg1252, i.e. windows-1252, but the document contains \u8804, which is the Unicode character U+2264 "≤" (8804 = 0x2264). Unfortunately, Windows-1252 does not contain this character, so your encode/decode algorithm can't represent it in the intermediate byte array.

stevengj commented 1 year ago

to get "around it" I forced the string to be a utf8 string all the time.

This is the right idea, I think — you don't want to try to express arbitrary Unicode characters in the codepage encoding, and I used the same basic principle in the Julia port. However, I think your implementation is incorrect. You did:

out += bytes.fromhex(_hex).decode(encoding=encoding).encode('utf8').decode()

for a hexidecimal directive \'xx. This assumes that you can convert the codepage to Unicode/UTF-8 one byte at a time, which will fail for multi-byte East Asian code pages. For multi-byte codepages, you can only decode characters after several bytes have been received, whereas _hex is only a single byte.

For example, I think it will fail for issue_11.rtf, which is in Windows-936 (a multi-byte encoding).

plessbd commented 1 year ago

Yeap, that is exactly what I was running into when i got outside of "my" files. In the mean time I outsourced to pandoc, I am still working on trying to fix this...

joshy commented 1 year ago

Can someone provide a PR?

stevengj commented 1 year ago

I think @plessbd's code will work if they replace fromhex(_hex) with fromhex(hexes), where

  1. hexes is the concatenation of all the previous consecutive hex \'xx values.
  2. you only call out += bytes.fromhex(hexes).decode(encoding=encoding).encode('utf8').decode() when the first non-hex value is encountered, at which point you "flush the cache" of hexes, decode them all together in the codepage, and then re-encode to utf-8
plessbd commented 1 year ago

@joshy created #40 all tests pass

 test session starts
platform darwin -- Python 3.11.2, pytest-7.3.0, pluggy-1.0.0
collected 29 items

tests/test_ansicpg1250.py .                                                                                                  [  3%]
tests/test_bytes.py .                                                                                                        [  6%]
tests/test_calcium_score.py .                                                                                                [ 10%]
tests/test_encoding.py .                                                                                                     [ 13%]
tests/test_french.py .                                                                                                       [ 17%]
tests/test_from_strings.py ..                                                                                                [ 24%]
tests/test_hello.py .                                                                                                        [ 27%]
tests/test_hyperlinks.py ...                                                                                                 [ 37%]
tests/test_issue_11.py .                                                                                                     [ 41%]
tests/test_issue_15.py .                                                                                                     [ 44%]
tests/test_issue_20.py .                                                                                                     [ 48%]
tests/test_issue_28.py .                                                                                                     [ 51%]
tests/test_issue_29.py ..                                                                                                    [ 58%]
tests/test_issue_38.py ..                                                                                                    [ 65%]
tests/test_line_breaks_google_docs.py .                                                                                      [ 68%]
tests/test_line_breaks_textedit_mac.py .                                                                                     [ 72%]
tests/test_nested_table.py .                                                                                                 [ 75%]
tests/test_nutridoc.py .                                                                                                     [ 79%]
tests/test_sample_3.py .                                                                                                     [ 82%]
tests/test_simple.py ..                                                                                                      [ 89%]
tests/test_speiseplan.py .                                                                                                   [ 93%]
tests/test_tx_rtf32.py .                                                                                                     [ 96%]
tests/test_unicode.py .                                                                                                      [100%]
 29 passed in 0.50s

Thank you @stevengj for the tip about handling the multiple hex values. Trying to work on this after working all day and wrap my head around it was not working, which is why I stopped and switched to pandoc for my use case. Hopefully this will help @nikita-petrashen as well.