Closed ChristophGmeiner closed 3 weeks ago
please provide the generated pdf
There is no bug in pypdf: libreoffice attaches all the images to all the pages
it is only within the contents that only some images are called. If you want only theimages used your code should look like this:
for p in pdf_reader.pages:
print(p.page_number)
print([img.name for img in p.images if img.name.split('.')[0] in [op[0][0][1:] for op in p.get_contents().operations if op[1]==b"Do"]])
img.name contains the key but has no "/" and has the extension so the code is a little tricky
@pubpub-zz is the code that you provided the universal code for images used on specific pages; or is it only for the PDFs generated by LibreOffice? Is the behavior of LibreOffice a bug, or is it a valid variant?
The provided code should be mostly universal, as it basically checks which image objects actually are referenced in the page itself on plain PDF operator level.
I am not sure whether the standard says anything about this, so you might want to have a look at it yourself, but from my experience the behavior of LibreOffice in this case is something I usually do not see from other PDF generators.
the standard provides "grammar" not "stories". This approach is compliant and i've seen it many times. also if something is to be "blamed" it is libreoffice, not pypdf. I convert this as a discussion for knowledge
See all details and code here on ther LibreOffice site: https://ask.libreoffice.org/t/pdf-encoding-issues-with-page-image-matching-when-converting-docx-to-pdf/110150/3
Environment
Which environment were you using when you encountered the problem?