pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.34k stars 509 forks source link

save(garbage=3) breaks get_pixmap() with side effect #2596

Closed Max-ES closed 1 year ago

Max-ES commented 1 year ago

Problem

After saving the fitz.Document with pdf.save("filename.pdf", garbage=3) the embedded images of the document I can see in the pixmap of the pages are mixed up: pixmap = pdf[0].get_pixmap(). fitz.Document.save(garbage=3) and fitz.Document.write(garbage=3) appear to have side effects that change the document in place. When I set garbage to 2 it works as expected.

Observation

Checked checkboxes become unchecked boxes and the other way around, some checkboxes become logos etc..

Code Example

pdf = fitz.Document("filename.pdf")
# pixmap of the first page looks just fine
pix0 = pdf[0].get_pixmap()

# save with "garbage=3" breaks the pdf in memory (the saved pdf on disk is still fine)
pdf.save("debug.pdf", garbage=3)

# the pixmap is now broken - images in the pixmap are mixed up
pix1 = pdf[0].get_pixmap()  # <---------- PROBLEM

# read the saved pdf again
new_pdf = fitz.Document("debug.pdf")
# the pixmap of the saved and read pdf file is not broken!
pix2 = new_pdf[0].get_pixmap()

Sadly I can't provide the affected internal documents.

Version

PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library.
Version date: 2022-08-13 00:00:01.
Built for Python 3.9 on darwin (64-bit).

Installed with pip install PyMuPdf I also tested it on 1.22.5 on macOS with the same result.

JorjMcKie commented 1 year ago

This is sad indeed. Please do provide some reproducing file, or otherwise we won't be able to accept the report.

JorjMcKie commented 1 year ago

Other missing information pieces pertain to you platform and PyMuPDF version. Please make sure to provide all this.

Max-ES commented 1 year ago

Thanks for the quick response. I updated the description. I will have to check if there is a way to provide you with the documents - this might take some time.

JorjMcKie commented 1 year ago

Thanks for the quick response. I updated the description. I will have to check if there is a way to provide you with the documents - this might take some time.

You can use my private e-mail address. This way there will be no public visibility.

JorjMcKie commented 1 year ago

I just received the example file - thanks a lot.

I did a few tests and found that this probably is no bug. Instead, a few precautions should be taken to ensure properly created pixmaps in this situation. A few preliminary comments on what is happening on Document.save():

  1. The base library maintains a cache in memory to improve performance. Large objects are stored there - first and foremost images and pixmaps.

  2. Depending on the save options, the Document in memory is being changed. For any positive garbage value, PDF objects may be deleted or be updated. The four positive garbage levels are like "onion skins": a larger value implies all actions of the lower levels.

    • Garbage 1 deletes unreferenced objects. This is BTW the minimum value to use after redactions have been applied: otherwise, content invalidated by redactions will still be physically present and continue to be potential data leaks.
    • Garbage 2 compacts the XREF table.
    • Garbage 3 will merge objects with equal definitions. That may lead to renumbering all objects in terms of their XREF numbers.
    • Garbage 4 finally also searches for duplicate binary streams in the Document and consolidates them.
  3. With this in mind, the observed behavior can be explained. When you used pdf[0] instead of assigning a persistent Page object for taking the Pixmap, you already took half the way to prevent the worst. When reusing the same Page object before and after the save, a page = pdf.reload_page(page) would do the same job.

    • A complete safeguard would be to also empty the internal cache of the base library.

So wrapping up, the following sequence of instructions should prevent the problem:

pix1 = page.get_pixmap()
doc.save(... with garbage>0 or other changes ...)

# the following instructions will make sure that
# the pixmap is made from updated data.
fitz.TOOLS.store_shrink(100)  # clears the internal cache
page = doc.reload_page(page)  # reload the page preserving dependent objects

pix2 = page.get_pixmap()

So as a general rule, you should

  1. Clear the cache before rendering pages after saving the document.
  2. Reload a page object before rendering after saving the document.
    • In the next version, we will include cache clearing in the Document.reload_page(), such that this instruction will be sufficient.
Max-ES commented 1 year ago

Thank you for your work! I understand, that a garbage value > 0 changes the loaded page by optimizing it.

But the mixing up of the images in the page still seems like a bug to me. When the xref numbers are regenerated with garbage=3, I would expect the pixmap to look the same, without reloading the page, especially because the saved document looks fine. (The saved document gets optimized and the loaded page gets 'mangled' in the process.)

If I understood it correctly the problems are the cached images that are renumbered during saving and falsely referenced after that which leads to the switched appearances.

Updating the cached images with the new indices during saving would fix the issue, wouldn't it?

JorjMcKie commented 1 year ago

Updating the cached images with the new indices during saving would fix the issue, wouldn't it?

May well be. But that is not what a cache is for: its sense in life is to save work, not the opposite. The adequate way of doing this is (1) empty the cache, (2) reload the page. If then another Pixmap of the page is required, it will automatically be based on current page contents and not on invalidated cache pieces.

julian-smith-artifex-com commented 1 year ago

Fixed in 1.23.5.