Closed Max-ES closed 1 year ago
This is sad indeed. Please do provide some reproducing file, or otherwise we won't be able to accept the report.
Other missing information pieces pertain to you platform and PyMuPDF version. Please make sure to provide all this.
Thanks for the quick response. I updated the description. I will have to check if there is a way to provide you with the documents - this might take some time.
Thanks for the quick response. I updated the description. I will have to check if there is a way to provide you with the documents - this might take some time.
You can use my private e-mail address. This way there will be no public visibility.
I just received the example file - thanks a lot.
I did a few tests and found that this probably is no bug. Instead, a few precautions should be taken to ensure properly created pixmaps in this situation.
A few preliminary comments on what is happening on Document.save()
:
The base library maintains a cache in memory to improve performance. Large objects are stored there - first and foremost images and pixmaps.
Depending on the save options, the Document
in memory is being changed. For any positive garbage value, PDF objects may be deleted or be updated. The four positive garbage levels are like "onion skins": a larger value implies all actions of the lower levels.
With this in mind, the observed behavior can be explained. When you used pdf[0]
instead of assigning a persistent Page
object for taking the Pixmap
, you already took half the way to prevent the worst. When reusing the same Page
object before and after the save
, a page = pdf.reload_page(page)
would do the same job.
So wrapping up, the following sequence of instructions should prevent the problem:
pix1 = page.get_pixmap()
doc.save(... with garbage>0 or other changes ...)
# the following instructions will make sure that
# the pixmap is made from updated data.
fitz.TOOLS.store_shrink(100) # clears the internal cache
page = doc.reload_page(page) # reload the page preserving dependent objects
pix2 = page.get_pixmap()
So as a general rule, you should
Document.reload_page()
, such that this instruction will be sufficient.Thank you for your work! I understand, that a garbage value > 0 changes the loaded page by optimizing it.
But the mixing up of the images in the page still seems like a bug to me. When the xref numbers are regenerated with garbage=3, I would expect the pixmap to look the same, without reloading the page, especially because the saved document looks fine. (The saved document gets optimized and the loaded page gets 'mangled' in the process.)
If I understood it correctly the problems are the cached images that are renumbered during saving and falsely referenced after that which leads to the switched appearances.
Updating the cached images with the new indices during saving would fix the issue, wouldn't it?
Updating the cached images with the new indices during saving would fix the issue, wouldn't it?
May well be. But that is not what a cache is for: its sense in life is to save work, not the opposite. The adequate way of doing this is (1) empty the cache, (2) reload the page. If then another Pixmap of the page is required, it will automatically be based on current page contents and not on invalidated cache pieces.
Fixed in 1.23.5.
Problem
After saving the fitz.Document with
pdf.save("filename.pdf", garbage=3)
the embedded images of the document I can see in the pixmap of the pages are mixed up:pixmap = pdf[0].get_pixmap()
.fitz.Document.save(garbage=3)
andfitz.Document.write(garbage=3)
appear to have side effects that change the document in place. When I setgarbage
to 2 it works as expected.Observation
Checked checkboxes become unchecked boxes and the other way around, some checkboxes become logos etc..
Code Example
Sadly I can't provide the affected internal documents.
Version
Installed with
pip install PyMuPdf
I also tested it on 1.22.5 on macOS with the same result.