Closed tovrstra closed 1 day ago
After some trial and error, I figured that scrubbing the source first fixes the problem:
import fitz
src = fitz.open("tinydft.pdf")
src.scrub()
dst = fitz.open()
dst.insert_pdf(src, from_page=0, to_page=0, final=True)
dst.set_metadata({})
dst.del_xml_metadata()
dst.xref_set_key(-1, "ID", "null")
dst.scrub()
dst.save("tmp.pdf", garbage=4, deflate=True, no_new_id=True)
dst.close()
src.close()
I can use this as a workaround, but there is probably still something in PyMuPDF that should be fixed. (?)
This is not a bug: the file is broken. The message says that the object cross reference table has an object entry (xref number) which is not present in the file.
OK, I agree that the PDF is broken.
I'm still confused about the fact that the scrub
function cannot fix it when applied to dst
, but can fix it when applied to src
in the example. What would be the right way to deal with such broken PDFs in general?
OK, I agree that the PDF is broken.
I'm still confused about the fact that the
scrub
function cannot fix it when applied todst
, but can fix it when applied tosrc
in the example. What would be the right way to deal with such broken PDFs in general?
If the PDF has this type of error, this is detected only when the respective object is actually referenced. There is no global scan or validity check that verifies a PDF's health or similar. So the error may pop up at any time - even after other work has been done successfully - or never.
You misunderstood the purpose of scrub()
: it is not a PDF health checker! It only removes information for data protection purposes mostly.
In the course of that and depending on the selected options this type of error may be detected ... or not.
If the error pops up, then the PDF might need to be repaired - depending which xref points into the wild. If this was optional stuff, nothing catastrophic may be happening.
To clean a PDF from this error (and if it is worthwhile spending the time), you can revive those xref ghosts yourself using this:
for xref in range(1, doc.xref_length()):
try:
c = doc.xref_object(xref) # access xref source code
except: # create an empty option if xref points to nowhere
doc.update_object(xref, "<<>>") # make an empty object for the xref
Description of the bug
The
scrub
method fails on some documents with the following error message:(I've shortened the local path to my venv.)
How to reproduce the bug
The following script reproduces the error. (This is a simplified version of a script I use to remove metadata from a PDF, as to make them binary reproducible. The software generating the PDFs adds all sorts of random stuff, which makes it difficult to track changes.)
This script fails for some input PDFs, such as the attached tinydft.pdf
I can reproduce this bug with many PyMuPDF versions: 1.24.7, 1.24.6, 1.24.5, 1.24.4, 1.24.3, 1.24.2, 1.24.1, 1.24.0 and 1.23.26. The oldest I could test was 1.23.5, for which the error message is slightly different:
(All tests were done on Fedora 40, in a Python 3.12.4 venv.) If you need more info, please let me know.
PyMuPDF version
1.24.7
Operating system
Linux
Python version
3.12