py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.41k stars 1.42k forks source link

`PdfReader` causes memory overflow for a particular PDF #2876

Closed JaMe76 closed 1 month ago

JaMe76 commented 1 month ago

When running the code with the PDF that you can find in the attachment, I observe a consistent memory increase. This increase only stops once the process terminates.

I was able to trace back to the point to where the memory consumption starts: You can find it in the screenshot below. I even observe an increase of consumption AFTER this step in a debug mode.

Screenshot from 2024-09-27 12-21-23

N.B. The PDF seems to be corrupted after p.27.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.8.0-45-generic-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('cryptography', '43.0.1'), PIL=10.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

if __name__ == "__main__":

    pdf_files = "path_to/7f40cb209fb97d1782bffcefc5e7be40.pdf"
    reader = PdfReader(pdf_files)
    print(len(reader.pages))

7f40cb209fb97d1782bffcefc5e7be40.pdf

Traceback

Ignoring wrong pointing object 20 0 (offset 175628)
Ignoring wrong pointing object 21 0 (offset 178628)
Ignoring wrong pointing object 22 0 (offset 179024)
Ignoring wrong pointing object 23 0 (offset 181968)
Ignoring wrong pointing object 24 0 (offset 183153)
Ignoring wrong pointing object 40 0 (offset 324274)
Ignoring wrong pointing object 83 0 (offset 716564)
Ignoring wrong pointing object 84 0 (offset 716986)
Ignoring wrong pointing object 85 0 (offset 719185)
Ignoring wrong pointing object 121 0 (offset 913007)
Ignoring wrong pointing object 160 0 (offset 1098502)
Ignoring wrong pointing object 177 0 (offset 1144830)
Ignoring wrong pointing object 178 0 (offset 1145264)
Ignoring wrong pointing object 202 0 (offset 1240467)
Ignoring wrong pointing object 203 0 (offset 1240848)
Ignoring wrong pointing object 204 0 (offset 1243016)
pubpub-zz commented 1 month ago

can you please update to latest version and confirm results.

JaMe76 commented 1 month ago

Yes, same results after upgrading to pypdf 5.0.0

pubpub-zz commented 1 month ago

When getting the number of pages, pypdf requires to load/expand objects(pages and objects below) in memory. Why do you consider a memory overflow ?

pubpub-zz commented 1 month ago

(for issue creation/tracking) Note : With the latest dev build

an infinite loop occurs when getting number of pages https://github.com/user-attachments/files/17162634/7f40cb209fb97d1782bffcefc5e7be40.pdf

JaMe76 commented 1 month ago

When getting the number of pages, pypdf requires to load/expand objects(pages and objects below) in memory. Why do you consider a memory overflow ?

Thanks for now.

This was the termination message of the process after trying to open the file for about three hours.

As this might be related to #2877, I will check if the latest fix will also close this one here.

pubpub-zz commented 1 month ago

This was the termination message of the process after trying to open the file for about three hours.

This sounds like the same issue and my fix will work : have a look here to test the latest dev version: https://pypdf.readthedocs.io/en/stable/dev/testing.html#evaluate-a-pr-in-progress-version

JaMe76 commented 1 month ago

Fix works: Thx again.