py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.31k stars 1.41k forks source link

Code sample in documentation causes exception #2050

Closed Dotrar closed 1 year ago

Dotrar commented 1 year ago

Reading this page, it suggests doing the following:

reader = PdfReader("example.pdf")
...

for page in reader.pages:
    page.compress_content_streams()  # This is CPU intensive!
    writer.add_page(page)

However, this can't work, as there is a specific exception ensuring that page is from PdfWriter is being used (code sample is using page from PdfReader).


We found this as we've recently updated from ~2.8.X --> 3.0.X and now our 600kb files from the deprecated PdfMerger are now 10mb (!) files by using PdfWriter, so we desperately need to reduce file sizes, if you have any more tips about that :)

stefan6419846 commented 1 year ago

You should use the correct RTD page, which seems to use the correct code: https://pypdf.readthedocs.io/en/latest/user/file-size.html#lossless-compression

Dotrar commented 1 year ago

Thanks for the clarification. quite a mix-up with the two projects.

MartinThoma commented 1 year ago

@Dotrar I agree that the read the docs (RTD) generated docs of PyPDF2 are confusing as they are still linked in a lot of places + look very similar to pypdf. I've opened https://github.com/py-pdf/pypdf/issues/2051 to fix that.

Dotrar commented 1 year ago

@MartinThoma thanks for that.

Regarding our issue at hand, we've found a massive increase in filesizes when joining multiple pdf's together after we've updated from the old pypdf2 PdfMerger to the newer pypdf PdfWriter - using the same example code on the docs.

Is there some documentation changes or something obvious we're missing or should i just make a new ticket?

ATM, the solution for us is to revert back and use to the older pyPDF2 ( ._.)

MartinThoma commented 1 year ago

Can you tell me more about which merge / append methods you're using?

Do you add watermarks/stamps?

pubpub-zz commented 1 year ago

@Dotrar You should use append() and not add_page(). Some issues used to be reported. You should also search within the closed threads