Document.scrub() raises `RuntimeError: code=7: cannot find object in xref ...`

tovrstra commented 3 days ago

Description of the bug

The scrub method fails on some documents with the following error message:

Traceback (most recent call last):
  File ".../debug.py", line 10, in <module>
    dst.scrub()
  File .../venv/lib/python3.12/site-packages/pymupdf/utils.py", line 4459, in scrub
    if not doc.xref_object(xref):
           ^^^^^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.12/site-packages/pymupdf/__init__.py", line 5895, in xref_object
    ret = extra.xref_object( self.this, xref, compressed, ascii)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.12/site-packages/pymupdf/extra.py", line 117, in xref_object
    return _extra.xref_object(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: code=7: cannot find object in xref (40 0 R)

(I've shortened the local path to my venv.)

How to reproduce the bug

The following script reproduces the error. (This is a simplified version of a script I use to remove metadata from a PDF, as to make them binary reproducible. The software generating the PDFs adds all sorts of random stuff, which makes it difficult to track changes.)

import fitz
src = fitz.open("tinydft.pdf")
dst = fitz.open()
dst.insert_pdf(src, from_page=0, to_page=0, final=True)
dst.set_metadata({})
dst.del_xml_metadata()
dst.xref_set_key(-1, "ID", "null")
dst.scrub()
dst.save("tmp.pdf", garbage=4, deflate=True, no_new_id=True)
dst.close()
src.close()

This script fails for some input PDFs, such as the attached tinydft.pdf

I can reproduce this bug with many PyMuPDF versions: 1.24.7, 1.24.6, 1.24.5, 1.24.4, 1.24.3, 1.24.2, 1.24.1, 1.24.0 and 1.23.26. The oldest I could test was 1.23.5, for which the error message is slightly different:

Traceback (most recent call last):
  File ".../debug.py", line 10, in <module>
    dst.scrub()
  File ".../venv/lib/python3.12/site-packages/fitz/utils.py", line 4276, in scrub
    raise ValueError(msg)
ValueError: bad xref 40 - clean PDF before scrubbing

(All tests were done on Fedora 40, in a Python 3.12.4 venv.) If you need more info, please let me know.

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.12

tovrstra commented 3 days ago

After some trial and error, I figured that scrubbing the source first fixes the problem:

import fitz
src = fitz.open("tinydft.pdf")
src.scrub()
dst = fitz.open()
dst.insert_pdf(src, from_page=0, to_page=0, final=True)
dst.set_metadata({})
dst.del_xml_metadata()
dst.xref_set_key(-1, "ID", "null")
dst.scrub()
dst.save("tmp.pdf", garbage=4, deflate=True, no_new_id=True)
dst.close()
src.close()

I can use this as a workaround, but there is probably still something in PyMuPDF that should be fixed. (?)

JorjMcKie commented 2 days ago

This is not a bug: the file is broken. The message says that the object cross reference table has an object entry (xref number) which is not present in the file.

tovrstra commented 2 days ago

OK, I agree that the PDF is broken.

I'm still confused about the fact that the scrub function cannot fix it when applied to dst, but can fix it when applied to src in the example. What would be the right way to deal with such broken PDFs in general?

JorjMcKie commented 1 day ago

OK, I agree that the PDF is broken.

I'm still confused about the fact that the scrub function cannot fix it when applied to dst, but can fix it when applied to src in the example. What would be the right way to deal with such broken PDFs in general?

If the PDF has this type of error, this is detected only when the respective object is actually referenced. There is no global scan or validity check that verifies a PDF's health or similar. So the error may pop up at any time - even after other work has been done successfully - or never.

You misunderstood the purpose of scrub(): it is not a PDF health checker! It only removes information for data protection purposes mostly. In the course of that and depending on the selected options this type of error may be detected ... or not. If the error pops up, then the PDF might need to be repaired - depending which xref points into the wild. If this was optional stuff, nothing catastrophic may be happening. To clean a PDF from this error (and if it is worthwhile spending the time), you can revive those xref ghosts yourself using this:

for xref in range(1, doc.xref_length()):
    try:
        c = doc.xref_object(xref)  # access xref source code
    except:  # create an empty option if xref points to nowhere
        doc.update_object(xref, "<<>>")  # make an empty object for the xref

pymupdf / PyMuPDF