smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Duplicate images in PDF > Find which pages it occurs on #718

Open eddih19 opened 4 weeks ago

eddih19 commented 4 weeks ago

Hi!

I've got a document which contains several duplicates of images, getObjectsByType('XObject', 'Image') gets all the images but seems to "skip" or combine duplicates. Usually this wouldn't be a problem but I am relying on the specific order of the images.

Is one of the following solutions possible?;

  1. Loop through each page and get the Xobjects from the specific page (getObjectsByType doesn't work in the getPages() loop)

  2. Check the getDetails() (or something similar) of the object to see which pages it occurs on?

  3. Disable the skip/combine function and obtain all images

Many thanks in advance! :)