jesparza / peepdf

Powerful Python tool to analyze PDF documents
http://peepdf.eternal-todo.com
GNU General Public License v3.0
1.28k stars 241 forks source link

error parsing when object/stream put after %%EOF #74

Open binjo opened 6 years ago

binjo commented 6 years ago

It appears Acrobat will render pdf files properly even when object/stream def after %%EOF, however peepdf will discard the content due to stop at %%EOF.

e.g: the recent hot pdf exploit, bd23ad33accef14684d42c32769092a0

0000023515 00000 n
0000024187 00000 n
0000024261 00000 n
trailer
<<
 /Size 67
 /Root 10 0 R
>>
startxref
24613
%%EOF

1 0 obj 
<<
 /Length 56305 
 /Filter /FlateDecode 
 >> 
 stream
....

Current peepdf will failed to parse, throws exception.

The following tries to fix the problem.

diff --git a/PDFCore.py b/PDFCore.py
index 3b2fe00..33cf5a4 100644
--- a/PDFCore.py
+++ b/PDFCore.py
@@ -4315,7 +4315,7 @@ class PDFBody :
                                 self.setObject(compressedId, compressedObject, offset)
                             del(compressedObjectsDict)
         for id in self.referencedJSObjects:
-            if id not in self.containingJS:
+            if (len(self.containingJS) and id not in self.containingJS):
                 object = self.objects[id].getObject()
                 if object == None:
                     errorMessage = 'Object is None'
@@ -6941,6 +6941,9 @@ class PDFParser :
                     self.fileParts.append(fileContent)
                 else:
                     sys.exit(errorMessage)
+        # append anything behind %%EOF
+        if fileContent:
+            self.fileParts.append(fileContent)
         pdfFile.setUpdates(len(self.fileParts) - 1)

         # Getting the body, cross reference table and trailer of each part of the file

Applying the change, there should be no issue of parsing said file:

Version 0:
        Catalog: 10
        Info: No
        Objects (50): [6, 7, 9, 10, 11, 12, 14, 15, 17, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 40, 41, 42, 43, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 61, 62, 63, 64, 65, 66]
                Errors (1): [33]
        Streams (14): [14, 15, 17, 25, 31, 32, 33, 34, 49, 51, 55, 56, 57, 62]
                Encoded (11): [14, 15, 17, 25, 31, 32, 33, 49, 51, 55, 56]
                Decoding errors (1): [33]
        Suspicious elements:
                /AcroForm (1): [10]
                /OpenAction (1): [10]
                /JS (1): [11]
                /JavaScript (1): [11]

Version 1:
        Catalog: No
        Info: No
        Objects (1): [1]
        Streams (1): [1]
                Encoded (1): [1]
        Objects with JS code (1): [1]
PPDF> object 1

<< /Length 56305
/Filter /FlateDecode >>
stream

var dlldata= [0x81ec8b55,0x000498ec,0xf4458900 ....

It's a quick fix, you may refactor the logic a bit...

Tigzy commented 6 years ago

That was fast, I was looking for this :) Sample here: https://malshare.io/sample.php?hash=e6b7392fb03ff9ff069a9ec5d4221641 I created a fix and PR for another parsing issue: https://github.com/jesparza/peepdf/pull/75 However the "hidden" stream isn't seeing because after the %%EOF, thanks for your code

jesparza commented 6 years ago

Thanks @binjo! I want to merge first everything from a fork which is more active right now than master, I will try to do this fast, but I need to do some testing before. It is curious that having an isolated object really works with Adobe Reader, I am quite sure I read all the specification years ago, or if was not documented or they changed something...:?