Closed osvenskan closed 3 years ago
@osvenskan Well, everything seems to be ok with the indirect object 4 0
:
>>> from pdfreader.parsers.document import PDFParser
>>> s=b'''4 0 obj
... <<
... /ProcSet [/PDF /Text /ImageB /ImageC]
... /ColorSpace <</CsMob1 [/Pattern /DeviceRGB]>>
... /Pattern << /PatMob4_R0 5 0 R >>
... /Font <</1=C0C000A0 6 0 R /2=C0C00080 7 0 R /3=C0C000G0 8 0 R /5=C0C008C0 9 0 R /6=C0C000I0 10 0 R /7=C0C000K0 11 0 R >>
... /XObject <</F1A42E9C 12 0 R >>
... >>
... endobj'''
>>> PDFParser(s, 0).indirect_object()
<IndirectObject:n=4,g=0 ... >
The issue is that the next object, which is 5 0
can't be parsed as it doesn't have enclosing endobj
element.
Is this the end of file? If not, can you share the next object (probably 6 0
)?
Sure, here's a larger snippet of the PDF, starting at the same place as the previous snippet but including what looks like the endobj
for 6 0
.
4 0 obj
<<
/ProcSet [/PDF /Text /ImageB /ImageC]
/ColorSpace <</CsMob1 [/Pattern /DeviceRGB]>>
/Pattern << /PatMob4_R0 5 0 R >>
/Font <</1=C0C000A0 6 0 R /2=C0C00080 7 0 R /3=C0C000G0 8 0 R /5=C0C008C0 9 0 R /6=C0C000I0 10 0 R /7=C0C000K0 11 0 R >>
/XObject <</F1A42E9C 12 0 R >>
>>
endobj
5 0 obj
<<
/Type /Pattern
/PatternType 1
/PaintType 2
/TilingType 1
/Resources << >>
/BBox [0 0 100 100]
/XStep 100
/YStep 100
/Matrix [0.013 0.013 -0.013 0.013 0 0]
/Length 179
>>
stream
1 w 70.00 50.00 m
70.00 61.04 61.04 70.00 50.00 70.00 c
38.96 70.00 30.00 61.04 30.00 50.00 c
30.00 38.96 38.96 30.00 50.00 30.00 c
61.04 30.00 70.00 38.96 70.00 50.00 c
f
endstream
6 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /1=C0C000A0
/Encoding /WinAnsiEncoding
/BaseFont /Helvetica
>>
endobj
@osvenskan well, technically this is an issue in the file, as every indirect object must have enclosing endstream
. Despite of this fact, stream object is parsable still and we may allow to miss endobj
here.
I did few changes on issue-64-allow-streams-without-endobj
branch, can you test it with your file and provide me with feedback? I want to be sure this change works before merging on master.
Dyakuyu, that works! I tested it with several other PDFs (all from the same source) that were throwing the error, and all parse quietly now. Sorry that my bank is producing non-standard PDFs. 🙂
I opened a PDF with this code --
And got this traceback --
The PDF from my bank so I can't share it.
:-/
. But I can share this snippet of the PDF which I think is what it's trying to parse when it raises the exception.Using
pdb
, I can see that on line 89 oftypes.DictBasedObject.__getitem__()
,item == 'PatMob4_R0'
when the exception occurs.