py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.33k stars 1.41k forks source link

#3 Using PdfReader causes a crash #2836

Closed macdeport closed 1 month ago

macdeport commented 2 months ago

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-13.6.9-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('cryptography', '41.0.7'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

    from pypdf import PdfReader

    reader = PdfReader(pdf_path); txt= ''
    for page in reader.pages:
        txt += page.extract_text() # <= Crash

Sorry I can't share this PDF with private information.

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 209, in parse_encoding
    encoding[x] = adobe_glyphs[o]  # type: ignore
    ~~~~~~~~^^^
IndexError: list assignment index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/alain/Documents/Logiciels/Developpement/py-km-pathfinder-selection/pathfinder-selection-ocred-pdf-compress.py", line 1743, in <module>
    txt_in = pdf_text(fn_in) # <=
             ^^^^^^^^^^^^^^^
  File "/Users/alain/Documents/Logiciels/Developpement/py-km-pathfinder-selection/pathfinder-selection-ocred-pdf-compress.py", line 981, in pdf_text
    txt += page.extract_text() # <=
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_page.py", line 2102, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_page.py", line 1612, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 57, in build_char_map_from_dict
    encoding, space_code = parse_encoding(ft, space_code)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_cmap.py", line 211, in parse_encoding
    encoding[x] = o  # type: ignore
    ~~~~~~~~^^^
IndexError: list assignment index out of range
pubpub-zz commented 2 months ago

@macdeport Can you make a test pdf with one page only and usine page.remove_text()?

macdeport commented 2 months ago
fp='/Users/alain/Documents/Perso/Alain/SDC35rM/sdc35-24-4!4-240905.pdf'

#--------------------------
def pdf_text_test(pdf_path):
    """

    (06/09/24 13:18:36)
    """
    #https://pypdf.readthedocs.io/en/stable/
    #https://pypdf.readthedocs.io/en/stable/user/metadata.html
    from pypdf import PdfReader

    reader = PdfReader(pdf_path)
    #txt=''
    #for page in reader.pages:
    #   txt += page.extract_text() # <= PB Crash
    print(reader.pages[0])
    (reader.pages[0]).remove_text()

    return() # pdf_text()
#--------------------------

pdf_text_test(fp)
{'/Type': '/Page', '/Parent': IndirectObject(3, 0, 4337925520), '/Contents': IndirectObject(5, 0, 4337925520), '/MediaBox': [0, 0, 595, 841], '/Resources': {'/Font': {'/F00': IndirectObject(6, 0, 4337925520), '/F01': IndirectObject(8, 0, 4337925520), '/F02': IndirectObject(10, 0, 4337925520), '/F03': IndirectObject(12, 0, 4337925520)}, '/ProcSet': IndirectObject(15, 0, 4337925520)}}
Traceback (most recent call last):
  File "/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/com.barebones.bbedit/BBEditRunTemp-untitled text 3.py", line 26, in <module>
    pdf_text_test(fp)
  File "/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/com.barebones.bbedit/BBEditRunTemp-untitled text 3.py", line 21, in pdf_text_test
    (reader.pages[0]).remove_text()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PageObject' object has no attribute 'remove_text'
pubpub-zz commented 2 months ago

oups : remove_text() applies to the full pdf. so the code should be like (from the top of my head):

import pypdf
w = pypdf.PdfWriter()
w.append("original.pdf",[0])
w.remove_text()
w.write("test_file.pdf")

check the file : no sensitive data should be in

macdeport commented 2 months ago

Two pieces of good news:

dumb_extract_text_crash.pdf