py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
7.99k stars 1.38k forks source link

binascii.Error: Odd-length string when parsing pdf #2216

Open vors opened 11 months ago

vors commented 11 months ago

Trying to extract text from one pdf page. Parsing crashes.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Darwin-22.6.0-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.16.2, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

https://github.com/vors/pypdf-text-parsing-repro (has pdf)

from pypdf import PdfReader

reader = PdfReader("input.pdf")
page = reader.pages[0]
page.extract_text()

Share here the PDF file(s) that cause the issue. The smaller they are, the better. Let us know if we may add them to our tests!

You can use them in your tests.

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "repro.py", line 5, in <module>
    page.extract_text()
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_page.py", line 2266, in extract_text
    visitor_text,
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_page.py", line 1901, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 30, in build_char_map
    space_width, ft
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 54, in build_char_map_from_dict
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 240, in parse_to_unicode
    int_entry,
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 310, in process_cm_line
    multiline_rg = parse_bfrange(line, map_dict, int_entry, multiline_rg)
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 369, in parse_bfrange
    ] = unhexlify(fmt2 % c).decode("utf-16-be", "surrogatepass")
binascii.Error: Odd-length string
MatteoRiva95 commented 6 months ago

@vors Did you succeed in solving the issue? If yes, how did you do it?

vors commented 6 months ago

oh I didn't dig too deep, but I forked and wrapped this line in try-catch and ignored the exception. Seems to at least get me through the problem :)