pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.61k stars 523 forks source link

`page.annots()`: ValueError: orphaned object: parent is None #2479

Closed cbm755 closed 1 year ago

cbm755 commented 1 year ago

I'm not sure what the expected behaviour is, but I'm getting a ValueError from .annots():

>> import fitz
>> fitz.version
Out[3]: ('1.22.3', '1.22.0', '20230510000001')

>> doc = fitz.open("DELETE_ME_tam_file_with_runtime_annot_errors.pdf")

>> list(doc[0].annots())
Out[8]: ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /usr/lib/python3.11/site-packages/IPython/core/formatters.py:706, in PlainTextFormatter.__call__(self, obj)
    699 stream = StringIO()
    700 printer = pretty.RepresentationPrinter(stream, self.verbose,
    701     self.max_width, self.newline,
    702     max_seq_length=self.max_seq_length,
    703     singleton_pprinters=self.singleton_printers,
    704     type_pprinters=self.type_printers,
    705     deferred_pprinters=self.deferred_printers)
--> 706 printer.pretty(obj)
    707 printer.flush()
    708 return stream.getvalue()

File /usr/lib/python3.11/site-packages/IPython/lib/pretty.py:393, in RepresentationPrinter.pretty(self, obj)
    390 for cls in _get_mro(obj_class):
    391     if cls in self.type_pprinters:
    392         # printer registered in self.type_pprinters
--> 393         return self.type_pprinters[cls](obj, self, cycle)
    394     else:
    395         # deferred printer
    396         printer = self._in_deferred_types(cls)

File /usr/lib/python3.11/site-packages/IPython/lib/pretty.py:640, in _seq_pprinter_factory.<locals>.inner(obj, p, cycle)
    638         p.text(',')
    639         p.breakable()
--> 640     p.pretty(x)
    641 if len(obj) == 1 and isinstance(obj, tuple):
    642     # Special case for 1-item tuples.
    643     p.text(',')

File /usr/lib/python3.11/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File /usr/lib/python3.11/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

File ~/.local/lib/python3.11/site-packages/fitz/fitz.py:8526, in Annot.__repr__(self)
   8525 def __repr__(self):
-> 8526     CheckParent(self)
   8527     return "'%s' annotation on %s" % (self.type[1], str(self.parent))

File ~/.local/lib/python3.11/site-packages/fitz/fitz.py:2975, in CheckParent(o)
   2973 def CheckParent(o: typing.Any):
   2974     if getattr(o, "parent", None) == None:
-> 2975         raise ValueError("orphaned object: parent is None")

ValueError: orphaned object: parent is None

I don't get the error if I send types:

>> list(doc[0].annots(types=[fitz.PDF_ANNOT_WIDGET]))
Out[36]: []

>> list(doc[0].annots(types=[fitz.PDF_ANNOT_POPUP]))
Out[37]: []

>> list(doc[0].annots(types=[fitz.PDF_ANNOT_LINK]))
Out[38]: []

I also don't get the error if reference a page directly:

>> p = doc[0]

>> list(p.annots())
Out[42]: 
['FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'Ink' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'Ink' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf,
 'FreeText' annotation on page 0 of DELETE_ME_tam_file_with_runtime_annot_errors.pdf]

I can't think why p.annots() would be different than doc[0].annots()...

cbm755 commented 1 year ago

Sorry I cannot share this PDF file, but its not a totally healthy file:

$ pdfinfo DELETE_ME_tam_file_with_runtime_annot_errors.pdf 
Creator:         BaKoMa TeX 11.80 29518P1573/366481723
Producer:        Lahore University of Management Sciences (LUMS), Lahore, Pakistan
CreationDate:    Wed Jun  7 12:07:10 2023 PDT
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            AcroForm
Syntax Error: Can't get Fields array<0a>
JavaScript:      no
Pages:           24
Encrypted:       no
Page size:       612 x 792 pts (letter)
Page rot:        0
File size:       1713255 bytes
Optimized:       no
PDF version:     1.4
cbm755 commented 1 year ago

somewhat surprisingly:

In [10]: doc.is_repaired
Out[10]: False

The first few lines look like this:

%PDF-1.4^M
%¡³Å×^M
1 0 obj^M
[/PDF/Text/ImageB/ImageC]^M
endobj^M
2 0 obj^M
<</AcroForm 174 0 R /Pages 4 0 R /Type/Catalog>>^M
endobj^M
3 0 obj^M
<</CreationDate(D:20230607190710)/Creator(BaKoMa TeX 11.80 29518P1573/366481723)/Producer(Lahore University of Management Sciences \(LUMS\), Lahore, Pakistan)>>^M
endobj^M
4 0 obj^M
<</Count 24/Kids[ 5 0 R  6 0 R  7 0 R  8 0 R  9 0 R  10 0 R  11 0 R  12 0 R  13 0 R  14 0 R  15 0 R  16 0 R  17 0 R  18 0 R  19 0 R  20 0 R  21 0 R  22 0 R  23 0 R  24 0 R  25 0 R  26 0 R  27 0 R  28 0 R ]/MediaBox[ 0 0 612 792]/Type/Pages>>^M
endobj^M
5 0 obj^M
<</Annots[ 176 0 R  178 0 R  181 0 R  183 0 R  185 0 R  187 0 R  189 0 R  191 0 R  193 0 R  195 0 R  198 0 R  200 0 R  201 0 R  203 0 R  205 0 R ]/Contents 34 0 R /Parent 4 0 R /ProcSet 1 0 R /Resources 29 0 R /Type/Page>>^M
endobj^M
6 0 obj^M
<</Annots[ 207 0 R ]/Contents 46 0 R /Parent 4 0 R /ProcSet 1 0 R /Resources 36 0 R /Type/Page>>^M
endobj^M
7 0 obj^M
<</Annots[ 222 0 R ]/Contents 49 0 R /Parent 4 0 R /ProcSet 1 0 R /Resources 48 0 R /Type/Page>>^M
endobj^M
8 0 obj^M
<</Annots[ 224 0 R  2335 0 R ]/Contents 52 0 R /Parent 4 0 R /ProcSet 1 0 R /Resources 51 0 R /Type/Page>>^M
endobj^M
9 0 obj^M

And there are many more <</Annots but I'm nervous about posting content.

Let me know if you need any other info from the file...

JorjMcKie commented 1 year ago

That's a lot of posts. There is no bug, and I think I can address what seems surprising at first sight. Please let me transfer this to "Discussions" to have more room for details.