joshy / striprtf

Stripping rtf to plain old text
http://striprtf.dev
BSD 3-Clause "New" or "Revised" License
94 stars 27 forks source link

ValueError: chr() arg not in range(0x110000) #2

Closed pierremonico closed 5 years ago

pierremonico commented 5 years ago

Hi,

It works well most of the time, but I got following error at some point: File "~/.env/lib/python3.7/site-packages/striprtf/striprtf.py", line 112, in rtf_to_text if c > 127: out.append(chr(c)) #NOQA ValueError: chr() arg not in range(0x110000)

I debugged a little bit, and the problematic arg is 2018444. I can catch the error so no big deal, but it would be great if you could fix it. I didn't have time to dig into your implementation and I don't have a single clue about RTF structure, but if needed I can also work on it and send you a PR when I find some time.

Cheers!

joshy commented 5 years ago

Hi,

I never checked python 3.7. Can you share a RTF which causes the crash?

-- Sent from phone

On 13 Feb 2019, at 23:25, pierremonico notifications@github.com wrote:

Hi,

It works well most of the time, but I got following error at some point: File "~/.env/lib/python3.7/site-packages/striprtf/striprtf.py", line 112, in rtf_to_text if c > 127: out.append(chr(c)) #NOQA ValueError: chr() arg not in range(0x110000)

I debugged a little bit, and the problematic arg is 2018444. I can catch the error so no big deal, but it would be great if you could fix it. I didn't have time to dig into your implementation and I don't have a single clue about RTF structure, but if needed I can also work on it and send you a PR when I find some time.

Cheers!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

pierremonico commented 5 years ago

Thanks for your reply. Unfortunately, I can't since it is IP of a client of mine and must stay undisclosed. I will try with other Python versions tonight and see if I can dig into it.

pierremonico commented 5 years ago

Hi,

I could figure out the error: It comes from a left quote \u2018 in this exact position 10‘000, which is apparently still used by some of my fellow Frenchs as decimal separator :)

I see you catch this as a special char. I also made tries with 10‘00, but those are handled without problems by the lib. So I guess it must be somewhere in your regex! (not handling dd‘ddd)

Hope it helps, Cheers.

joshy commented 5 years ago

Hi,

unfortunately I wasn't able to reproduce the error. I added a test case, see file french.rtf, french.txt and test_french.py, which works fine. It seems also it depends on the context where 10‘000 appears. Because you already identified where the exception is happenin,g maybe you can copy only the subsection into french.rtf without leaking any client information. Or create a minimal rtf which causes the crash.

Thanks for your help!

pierremonico commented 5 years ago

Hi and sorry for the late reply,

Weirdly enough, I can't reproduce it either using the test cases (I even tried different configurations). It still bugs on my given element though, but I just caught the error as it isn't that important :)

Thanks for your help and for making this.

joshy commented 5 years ago

Ok, then I will close it for now, because I can't really do anything.