maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

ERROR:root:!!!Failed to locate... exception #64

Closed osvenskan closed 3 years ago

osvenskan commented 3 years ago

I opened a PDF with this code --

with open(filepath, "rb") as f:
    viewer = SimplePDFViewer(f)

And got this traceback --

ERROR:root:!!!Failed to locate 5 0: assuming null
Traceback (most recent call last):
  File "/Users/me/.virtualenvs/stock_transaction_reader/lib/python3.9/site-packages/pdfreader/parsers/document.py", line 482, in locate_object
    _ = self.next_brute_force_object()
  File "/Users/me/.virtualenvs/stock_transaction_reader/lib/python3.9/site-packages/pdfreader/parsers/document.py", line 512, in next_brute_force_object
    obj = self.body_element() # can be either indirect object, startxref or trailer
  File "/Users/me/.virtualenvs/stock_transaction_reader/lib/python3.9/site-packages/pdfreader/parsers/document.py", line 261, in body_element
    obj = self.indirect_object()
  File "/Users/me/.virtualenvs/stock_transaction_reader/lib/python3.9/site-packages/pdfreader/parsers/document.py", line 544, in indirect_object
    obj = super(RegistryPDFParser, self).indirect_object()
  File "/Users/me/.virtualenvs/stock_transaction_reader/lib/python3.9/site-packages/pdfreader/parsers/document.py", line 65, in indirect_object
    self.on_parser_error("endobj expected")
  File "/Users/me/.virtualenvs/stock_transaction_reader/lib/python3.9/site-packages/pdfreader/parsers/base.py", line 48, in on_parser_error
    raise self.exception_class(message)
pdfreader.exceptions.ParserException: endobj expected

The PDF from my bank so I can't share it. :-/. But I can share this snippet of the PDF which I think is what it's trying to parse when it raises the exception.

4 0 obj
<<
/ProcSet [/PDF /Text /ImageB /ImageC]
/ColorSpace <</CsMob1 [/Pattern /DeviceRGB]>>
/Pattern << /PatMob4_R0 5 0 R >>
/Font <</1=C0C000A0 6 0 R /2=C0C00080 7 0 R /3=C0C000G0 8 0 R /5=C0C008C0 9 0 R /6=C0C000I0 10 0 R /7=C0C000K0 11 0 R >>
/XObject <</F1A42E9C 12 0 R >>
>>
endobj
5 0 obj
<<
/Type /Pattern
/PatternType 1
/PaintType 2
/TilingType 1
/Resources << >>
/BBox [0 0 100 100]
/XStep 100
/YStep 100
/Matrix [0.013 0.013 -0.013 0.013 0 0]
/Length 179
>>

Using pdb, I can see that on line 89 of types.DictBasedObject.__getitem__(), item == 'PatMob4_R0' when the exception occurs.

maxpmaxp commented 3 years ago

@osvenskan Well, everything seems to be ok with the indirect object 4 0:

>>> from pdfreader.parsers.document import PDFParser
>>> s=b'''4 0 obj
... <<
... /ProcSet [/PDF /Text /ImageB /ImageC]
... /ColorSpace <</CsMob1 [/Pattern /DeviceRGB]>>
... /Pattern << /PatMob4_R0 5 0 R >>
... /Font <</1=C0C000A0 6 0 R /2=C0C00080 7 0 R /3=C0C000G0 8 0 R /5=C0C008C0 9 0 R /6=C0C000I0 10 0 R /7=C0C000K0 11 0 R >>
... /XObject <</F1A42E9C 12 0 R >>
... >>
... endobj'''
>>> PDFParser(s, 0).indirect_object()
<IndirectObject:n=4,g=0 ... >

The issue is that the next object, which is 5 0 can't be parsed as it doesn't have enclosing endobj element. Is this the end of file? If not, can you share the next object (probably 6 0)?

osvenskan commented 3 years ago

Sure, here's a larger snippet of the PDF, starting at the same place as the previous snippet but including what looks like the endobj for 6 0.

4 0 obj
<<
/ProcSet [/PDF /Text /ImageB /ImageC]
/ColorSpace <</CsMob1 [/Pattern /DeviceRGB]>>
/Pattern << /PatMob4_R0 5 0 R >>
/Font <</1=C0C000A0 6 0 R /2=C0C00080 7 0 R /3=C0C000G0 8 0 R /5=C0C008C0 9 0 R /6=C0C000I0 10 0 R /7=C0C000K0 11 0 R >>
/XObject <</F1A42E9C 12 0 R >>
>>
endobj
5 0 obj
<<
/Type /Pattern
/PatternType 1
/PaintType 2
/TilingType 1
/Resources << >>
/BBox [0 0 100 100]
/XStep 100
/YStep 100
/Matrix [0.013 0.013 -0.013 0.013 0 0]
/Length 179
>>
stream
1 w 70.00 50.00 m 
70.00 61.04 61.04 70.00 50.00 70.00 c
38.96 70.00 30.00 61.04 30.00 50.00 c
30.00 38.96 38.96 30.00 50.00 30.00 c
61.04 30.00 70.00 38.96 70.00 50.00 c
f
endstream
6 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /1=C0C000A0
/Encoding /WinAnsiEncoding
/BaseFont /Helvetica
>>
endobj
maxpmaxp commented 3 years ago

@osvenskan well, technically this is an issue in the file, as every indirect object must have enclosing endstream. Despite of this fact, stream object is parsable still and we may allow to miss endobj here.

I did few changes on issue-64-allow-streams-without-endobj branch, can you test it with your file and provide me with feedback? I want to be sure this change works before merging on master.

PR: https://github.com/maxpmaxp/pdfreader/pull/66

osvenskan commented 3 years ago

Dyakuyu, that works! I tested it with several other PDFs (all from the same source) that were throwing the error, and all parse quietly now. Sorry that my bank is producing non-standard PDFs. 🙂