pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.75k stars 533 forks source link

Size issue of cropped PDF #4046

Closed Syntamin closed 1 week ago

Syntamin commented 1 week ago

Description of the bug

Hello,

I am currently using PyMuPDF to crop a small portion from a single-page PDF document and generate a new PDF file with only this selected area. However, I noticed that the file size of the new PDF is the same as the original, despite containing only a fraction of the content.

I was expecting the newly generated PDF to be significantly smaller than the original. Could you please advise on how to achieve this size reduction while retaining the PDF format? Any guidance on optimizing the output file size for cropped PDF documents would be greatly appreciated.

Thank you for your assistance.

How to reproduce the bug

doc = pymupdf.open(input_file_path) new_doc = pymupdf.open() page = doc[1] page_width, page_height = page.rect.width, page.rect.height rect = pymupdf.Rect(0, 0, page_width, 400) o_page = new_doc.new_page(-1, page_width, 400) o_page.show_pdf_page(o_page.rect, doc, 0, clip=rect) new_doc.save('./temp/small.pdf', garbage=4, use_objstms=1, clean=True, deflate=True) new_doc.close() doc.close()

PyMuPDF version

1.24.13

Operating system

MacOS

Python version

3.11

JorjMcKie commented 1 week ago

The method only shows a portion of the source page, but it is present in full in the target page. So no size reduction can be expected.

If you want to achieve that, you must physically remove everything around clip by adding the four respective rectangles as redactions. Then execute page.apply_redactions(), then execute show_pdf_page.

Syntamin commented 1 week ago

@JorjMcKie Does PyMuPDF provide a method to genuinely delete parts outside a specified rectangle, or does MuPDF offer such a feature?

JorjMcKie commented 1 week ago

@JorjMcKie Does PyMuPDF provide a method to genuinely delete parts outside a specified rectangle, or does MuPDF offer such a feature?

A good question - even prophetic 😎: The MuPDF team is preparing such a function! Not sure whether this will be part of the next version or one thereafter.