py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
7.88k stars 1.37k forks source link

`TypeError` in `_cmap.py` when calling `extract_text()` #2750

Open NikolaiLyssogor opened 3 weeks ago

NikolaiLyssogor commented 3 weeks ago

I'm trying to extract text from each page of a large number of PDFs. A few of them are giving me the issue shown in the traceback. This seems to be related to #2286.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-14.5-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.1, crypt_provider=('cryptography', '42.0.7'), PIL=10.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf
filepath = "path/to/file.pdf"
reader = pypdf.PdfReader(filepath)
pages = [reader.pages[i] for i in range(0, len(pdf.pages)]
page_text = [pg.extract_text() for pg in pages]

The PDF that is causing this issue can't be shared because it contains sensitive information. However, here is the result of reader.metadata:

{'/Producer': 'pypdf'}

I'm not the one creating the PDFs and unfortunately I haven't been able to reproduce the issue so that I can share it here.

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/pypdf/_cmap.py", line 445, in compute_space_width
    raise Exception("Not in range")
Exception: Not in range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/processors/base_processor.py", line 662, in extract_text
    page_text.append(page.extract_text())
                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pypdf/_page.py", line 2076, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pypdf/_page.py", line 1588, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pypdf/_cmap.py", line 93, in build_char_map_from_dict
    sp_width = compute_space_width(ft, sp, space_width)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pypdf/_cmap.py", line 459, in compute_space_width
    if x > 0:
       ^^^^^
TypeError: '>' not supported between instances of 'IndirectObject' and 'int'
stefan6419846 commented 3 weeks ago

Apparently one of the further cases where we are dealing with an object reference instead of direct values. In theory, using x.get_object() > 0 should work here.

NikolaiLyssogor commented 3 weeks ago

Thanks for the quick response. Adding

x = x.get_object() if isinstance(x, IndirectObject) else x

right before the line where the error is occurring solved the issue for me.

pubpub-zz commented 3 weeks ago

@NikolaiLyssogor you seem to be on an old version. Please upgrade to lastest version and retest

NikolaiLyssogor commented 3 weeks ago

Tested again with 4.2.0. The original issue still occurs. Also, the fix proposed above solves the issue in 4.2.0, at least for my own documents I have been testing this on.

pubpub-zz commented 3 weeks ago

Can you confirm that just adding x = x.get_object() works if you can you propose a PR on main branch?

NikolaiLyssogor commented 3 weeks ago

It's working on my documents. There was also no change to which tests are passing in the test suite. I'll open a PR.