pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.81k stars 920 forks source link

Crash on non-ASCII input. #1032

Open vk2diy opened 1 month ago

vk2diy commented 1 month ago

Description

Crash on non-ASCII input: UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)

Steps to reproduce the bug

To make it easier, this will download mc3362.pdf.

  1. wget https://github.com/user-attachments/files/16489263/mc3362.pdf && pdf2txt.py mc3362.pdf

Error produced

Traceback (most recent call last):
  File "pdf2txt.py", line 115, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
                                        ^^^^^^^^^^^^^^
  File "pdf2txt.py", line 110, in main
    interpreter.process_page(page)
  File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 841, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 854, in render_contents
    self.execute(list_value(streams))
  File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 869, in execute
    name = keyword_name(obj).decode('ascii')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
dhdaines commented 1 month ago

What version of pdfminer.six are you using? I can't reproduce this with either Python 3.11 or 3.12 and pdfminer.six v20240706.

vk2diy commented 1 month ago

Looks old.

./lib/python3.12/site-packages/pdfminer-20191125.dist-info

Unsure why it would be old, I used pip to install it. I'm not really a python person.