jesparza / peepdf

Powerful Python tool to analyze PDF documents
http://peepdf.eternal-todo.com
GNU General Public License v3.0
1.32k stars 242 forks source link

TypeError leads to an unhandled Exception #70

Open SebastianDeiss opened 6 years ago

SebastianDeiss commented 6 years ago

peepdf crashes with a TypeError if some PDFs are analyzed in force parsing mode and PDFObjectStream.resolveReferences() is invoked.

Traceback (most recent call last):
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/main.py", line 409, in main
    ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis)
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7098, in parse
    ret = body.updateObjects()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 4288, in updateObjects
    object.resolveReferences()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 3253, in resolveReferences
    ret = PDFParser.readObject(objectsSection[offset:])
TypeError: slice indices must be integers or None or have an __index__ method

If I fix that TypeError by converting offset at PDFCore.py:3243 to an int object I get another one:

Traceback (most recent call last):
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/main.py", line 409, in main
    ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis)
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7098, in parse
    ret = body.updateObjects()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 4288, in updateObjects
    object.resolveReferences()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 3253, in resolveReferences
    ret = PDFParser.readObject(objectsSection[offset:])
TypeError: unbound method readObject() must be called with PDFParser instance as first argument (got str instance instead)

A possible solution would be to supply the PDFParser object to PDFObjectStream when creating that instance and then provide the supplied PDFParser instance for readObject().

SebastianDeiss commented 6 years ago

@jesparza I could submit a pull for this issue like https://github.com/jbremer/peepdf/pull/6, which is based on your master.

Jack28 commented 4 years ago

google.com?q=filetype:pdf https://en.fh-westkueste.de/students/his/ These files created by HIS also produce an error. Could it be related?

michaelweiser commented 4 years ago

An extended fix for the TypeErrors is now over at jbremer/peepdf#9.

I have also seen some of those HIS-generated PDFs (which originate from Apache FOP 2.3) and they only ran into the object stream parsing problem caused by PDFParser.readUntilSymbol() resetting the buffer cursor fixed by commit 1 of that PR but not the TypeErrors. (That separate issue actually only exists in jbremer's fork.)