py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.17k stars 1.39k forks source link

Add viewport (bbox) of image rendered in the page to the FileImage class #2763

Closed FSeidinger closed 2 months ago

FSeidinger commented 2 months ago

Discussed in https://github.com/py-pdf/pypdf/discussions/2762

Originally posted by **FSeidinger** July 20, 2024 # Use case We get a lot of PDFs uploaded by customers, that are scanned documents or forms. So most of the time a PDF page only contains a single image. The customers mainly use smart phones or scanners to produce the uploaded PDFs. A lot of these phones and scanners produce PDFs with images embedded that are in full resolution of the camera and produce huge PDFs due to huge images embedded in the PDF. It is not uncommon to see images in a native 1.200 DPI resolution of even higher Before sending the images to an archive, I want to resize/resample the images for a target resolution of 72 DPI. # Current situation While pypdf gives me the images in the page and its physical size, it does not give me the viewport in user coordinates of the rendered image. This I would need to do the resample part. # Expected situation The `FileImage` or the `PageObject` class should be extended to also contain the rendered image BBox (viewport) in user coordinates.
FSeidinger commented 2 months ago

FYI. PyMuPDF has a solution for that using Page.get_image_bbox.

See Page.get_image_bbox for reference.

pubpub-zz commented 2 months ago

This is quite tricky. the image size is defined within the content of the page, not on the object. a same object can be used on multiple pages, and many times with different size on the same page.

FSeidinger commented 2 months ago

This is quite tricky. the image size is defined within the content of the page, not on the object. a same object can be used on multiple pages, and many times with different size on the same page.

Yes, I know. And my abilities to parse the PDF are by far not sufficient to do this by myself.

The way to go is similar to rendering the page by applying the PDF operands and get the BBOX from there. This is the way PyMuPDF does this.

pubpub-zz commented 2 months ago

just to come to a quick solution : can't you just consider that the image is displayed on the full page (very easy to get through mediabox) . you should be sufficient no ? remember you can use the .replace() function for the images

pubpub-zz commented 2 months ago

@FSeidinger does my proposal helped you?

pubpub-zz commented 2 months ago

Without any feed backs I close this issue. Feel free to send update to reopen if necessary