py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.09k stars 1.39k forks source link

Image/Page Matching Issue when using PDF file converted by LibreOffice #2822

Closed ChristophGmeiner closed 3 weeks ago

ChristophGmeiner commented 3 weeks ago

See all details and code here on ther LibreOffice site: https://ask.libreoffice.org/t/pdf-encoding-issues-with-page-image-matching-when-converting-docx-to-pdf/110150/3

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-14.5-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0
pubpub-zz commented 3 weeks ago

please provide the generated pdf

ChristophGmeiner commented 3 weeks ago

test04_libreoffice_exportwriter.pdf

pubpub-zz commented 3 weeks ago

There is no bug in pypdf: libreoffice attaches all the images to all the pages image

it is only within the contents that only some images are called. If you want only theimages used your code should look like this:

for p in pdf_reader.pages:
                print(p.page_number)
                print([img.name for img in p.images if img.name.split('.')[0] in [op[0][0][1:] for op in p.get_contents().operations if op[1]==b"Do"]])

img.name contains the key but has no "/" and has the extension so the code is a little tricky

mikekaganski commented 3 weeks ago

@pubpub-zz is the code that you provided the universal code for images used on specific pages; or is it only for the PDFs generated by LibreOffice? Is the behavior of LibreOffice a bug, or is it a valid variant?

stefan6419846 commented 3 weeks ago

The provided code should be mostly universal, as it basically checks which image objects actually are referenced in the page itself on plain PDF operator level.

I am not sure whether the standard says anything about this, so you might want to have a look at it yourself, but from my experience the behavior of LibreOffice in this case is something I usually do not see from other PDF generators.

pubpub-zz commented 3 weeks ago

the standard provides "grammar" not "stories". This approach is compliant and i've seen it many times. also if something is to be "blamed" it is libreoffice, not pypdf. I convert this as a discussion for knowledge