Closed nikita-petrashen closed 1 year ago
The text was a simple text in Russian with no special characters other than newlines and tabs. cp1251 encoding worked fine, but utf-8 didn't.
If the text worked with cp1251 then you can close the issue, right?
it worked only after I removed the lines that I mentioned
Can you please sent me the sample rtf file?
I am having the "same" issue
txt_str = rtf_to_text(rtf_str, encoding='utf-8', errors="strict")
my character is a combined <= symbol
the very basic rtf I can create:
{\rtf1\ansi\ansicpg1252\cocoartf2709
\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 ArialMT;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww11680\viewh15320\viewkind0
\deftab1134
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\tx9360\tx10080\pardeftab1134\partightenfactor0
\f0\fs18 \cf0 \uc0\u8804 \
\uc0\u8804 }
so my issue might not be 100% related, but to get "around it" I forced the string to be a utf8 string all the time. You can see my change here: https://github.com/plessbd/striprtf/compare/master...handle-some-dirta, it "works for me" but I have not had the time to fully test it. It seems the rtf's I have (that I have no ability to "fix" in the source) have some extra characters in them that according to https://www.ee.ucl.ac.uk/~mflanaga/java/HTMLandASCIItableWin.html should work, but they dont show up in https://github.com/python/cpython/blob/main/Lib/encodings/cp1252.py things like less than + equal too
I am guessing the system that created this has some other things running that I do not know about.
Looking at the code, it seems like if the encoding is X, then you first encode all of the output to X and then decode to Unicode at the end.
It seems like this will inherently cause problems any time the text contains Unicode characters that don't exist in encoding X.
For example, in @plessbd's example above, the RTF encoding is \ansicpg1252
, i.e. windows-1252
, but the document contains \u8804
, which is the Unicode character U+2264 "≤" (8804 = 0x2264). Unfortunately, Windows-1252 does not contain this character, so your encode/decode algorithm can't represent it in the intermediate byte array.
to get "around it" I forced the string to be a utf8 string all the time.
This is the right idea, I think — you don't want to try to express arbitrary Unicode characters in the codepage encoding, and I used the same basic principle in the Julia port. However, I think your implementation is incorrect. You did:
out += bytes.fromhex(_hex).decode(encoding=encoding).encode('utf8').decode()
for a hexidecimal directive \'xx
. This assumes that you can convert the codepage to Unicode/UTF-8 one byte at a time, which will fail for multi-byte East Asian code pages. For multi-byte codepages, you can only decode characters after several bytes have been received, whereas _hex
is only a single byte.
For example, I think it will fail for issue_11.rtf, which is in Windows-936 (a multi-byte encoding).
Yeap, that is exactly what I was running into when i got outside of "my" files. In the mean time I outsourced to pandoc, I am still working on trying to fix this...
Can someone provide a PR?
I think @plessbd's code will work if they replace fromhex(_hex)
with fromhex(hexes)
, where
hexes
is the concatenation of all the previous consecutive hex \'xx
values.out += bytes.fromhex(hexes).decode(encoding=encoding).encode('utf8').decode()
when the first non-hex value is encountered, at which point you "flush the cache" of hexes, decode them all together in the codepage, and then re-encode to utf-8@joshy created #40 all tests pass
test session starts
platform darwin -- Python 3.11.2, pytest-7.3.0, pluggy-1.0.0
collected 29 items
tests/test_ansicpg1250.py . [ 3%]
tests/test_bytes.py . [ 6%]
tests/test_calcium_score.py . [ 10%]
tests/test_encoding.py . [ 13%]
tests/test_french.py . [ 17%]
tests/test_from_strings.py .. [ 24%]
tests/test_hello.py . [ 27%]
tests/test_hyperlinks.py ... [ 37%]
tests/test_issue_11.py . [ 41%]
tests/test_issue_15.py . [ 44%]
tests/test_issue_20.py . [ 48%]
tests/test_issue_28.py . [ 51%]
tests/test_issue_29.py .. [ 58%]
tests/test_issue_38.py .. [ 65%]
tests/test_line_breaks_google_docs.py . [ 68%]
tests/test_line_breaks_textedit_mac.py . [ 72%]
tests/test_nested_table.py . [ 75%]
tests/test_nutridoc.py . [ 79%]
tests/test_sample_3.py . [ 82%]
tests/test_simple.py .. [ 89%]
tests/test_speiseplan.py . [ 93%]
tests/test_tx_rtf32.py . [ 96%]
tests/test_unicode.py . [100%]
29 passed in 0.50s
Thank you @stevengj for the tip about handling the multiple hex values. Trying to work on this after working all day and wrap my head around it was not working, which is why I stopped and switched to pandoc for my use case. Hopefully this will help @nikita-petrashen as well.
Hey!
I was trying to decode an RFT file and
rtf_to_text
threw an error "Cannot decode a symbol \u0417 blah-blah". I tried passing several different encodings tortf_to_text
but it didn't work. Then I looked into the source code and found this line. I have uncommented it and tried passing "blahblah" as theencoding
expecting that it'd throw me a print, but it didn't.Then I started to suspect that this or this line actually overrides the
encoding
argument that I pass to the function. After removing these two lines and setting the correct encoding everything went smoothly.Idk what's going on there but you should probably look into it as this feels wrong :).