jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.38k stars 146 forks source link

BUG Some images fail to load (?) #146

Open a-ndrewang opened 1 year ago

a-ndrewang commented 1 year ago

Describe the bug On loading and immediately dumping certain PDFs, images are lost. I am unsure whether it is because they have failed to load or whether they have failed to dump. I haven't yet figured out what is in common with these PDFs. Of note, sumatrapdf cannot render PDFs that were produced this way (i.e. loading and dumping at all. Though the Firefox PDF reader does, it loses the images. I have not investigated whether other readers can render these.

To Reproduce A file where this has been produced: fleur-dining-menu-210220.pdf


from borb.pdf import PDF
from borb.toolkit import ImageExtraction

bad_file = "fleur-dining-menu-210220.pdf"
exportname = 'fleur_export.pdf'
def main():
    l : ImageExtraction = ImageExtraction()

    with open(bad_file, 'rb') as f:
        pdf = PDF.loads(f, [l])

    print(l.extract_images()[0]) # returns a single image, the background. 
    # I wonder if the logo should be printed here?

    with open(exportname, 'wb') as f:
        PDF.dumps(f, pdf) # the logo 'fleur' is lost

if __name__ == "__main__":
    main()

Expected behaviour The same PDF should be reproduced after loading it and dumping it.

Screenshots Left - original; Right - after loading and dumping using borb. Sumatrapdf would not render the PDF on the right; firefox was used.

Screenshot 2022-12-06 202152

Desktop (please complete the following information):

I imagine that I'm missing or doing something wildly incorrect! Please correct me if so.

jorisschellekens commented 1 year ago

You are not doing anything wrong. However canva, the producer of this file is. There is a validator for PDF files online. You can find it here.

When I run it against your input PDF, it provides the following errors (taking only those related to colors and images):

The second one is not as severe as it sounds. Essentially, in order to be a PDF/A (archiveable) document, you need to embed a color profile (such that readers can calibrate themselves). This is only a requirement for archiveable PDF documents.

The first warning however, I have not yet seen that one before. I'll have a look.

Interesting problem! Thank you!

a-ndrewang commented 1 year ago

There is a validator for PDF files online. You can find it here.

Thanks for sharing this.

I have run it against other PDFs that have the same graphical issue. Of note, this PDF does not have the same validation errors as the input PDF as the fleur menu, but it does show the same 'missing graphic' issue.

As before, left is the original, and right is the result after loading and dumping from borb.

image

This input PDF returned these errors using veraPDF:

I wonder whether there is something else going on?