PDF parsing too long - Githubissues

Ryuchen commented 7 years ago

I have some pdf malware examples, when the static.py to analysis it, it was lock by parsing pdf ( the function of _parse in static.py), it last for more than 900 seconds. Can you provide a solution to solve it ?

mallorybobalice commented 7 years ago

same as https://github.com/spender-sandbox/cuckoo-modified/issues/54

for now some people interrupting cow in static we have an upstream report but doesn't seem like peepdf care https://github.com/jesparza/peepdf/issues/59 ?

Brad was going to merge interrupting cow in or fix upstream but probs not a priority for anyone? (personally we see a metric ton of these and even with a 15s timeout it's doing quite a bit of useless work at high cpu)

doomedraven commented 7 years ago

just spoke with jesparza, the main problem is time to investigate why it happens and write fix, so if someone have time feel free to help him :)

seifreed commented 7 years ago

You have any hash to test that issue?

doomedraven commented 7 years ago

1e3db20bb77178cabe8e32a47510a027bb38bc585ed02a95052e3965ac9a9b26 2c2c956d74dcc245655a6c56aa052212ac1a933e22e5a41d63afc1aa9d2eccf5 3eaef2ca2c9d29e936919c7c6f8e5614aef6edf8cec6c92008291bafea0388d0 63e0063d43ae9578c328b4683b53c868497dd41c8c112f3365308907ad444a84 2ab11d83ae2cbd12f0f6c30aacad8a8e16df5255646d08e923054b9f521c4b83

you asked for hashes, so fix it now :P

seifreed commented 7 years ago

LOL

Thanks for share

NO PR, not valid reply eh? XD

Thnx!

Marc Rivero López | @seifreed

De: doomedraven [mailto:notifications@github.com] Enviado el: lunes, 26 de septiembre de 2016 16:45 Para: spender-sandbox/cuckoo-modified cuckoo-modified@noreply.github.com CC: Marc Rivero López mriverolopez@gmail.com; Comment comment@noreply.github.com Asunto: Re: [spender-sandbox/cuckoo-modified] PDF parsing too long (#297)

1e3db20bb77178cabe8e32a47510a027bb38bc585ed02a95052e3965ac9a9b26 2c2c956d74dcc245655a6c56aa052212ac1a933e22e5a41d63afc1aa9d2eccf5 3eaef2ca2c9d29e936919c7c6f8e5614aef6edf8cec6c92008291bafea0388d0 63e0063d43ae9578c328b4683b53c868497dd41c8c112f3365308907ad444a84 2ab11d83ae2cbd12f0f6c30aacad8a8e16df5255646d08e923054b9f521c4b83

you asked for hashes, so fix it now :P

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spender-sandbox/cuckoo-modified/issues/297#issuecomment-249590906 , or mute the thread https://github.com/notifications/unsubscribe-auth/ABwfr3jLyw3gvqm-7VgJI_7wOIy1e2i6ks5qt9pOgaJpZM4KCn7W . https://github.com/notifications/beacon/ABwfry5AB6DguMXG--S7O_0uj9HgydWiks5qt9pOgaJpZM4KCn7W.gif

Ryuchen commented 7 years ago

appreciate for your help! thanks all of you.

jesparza commented 7 years ago

Hi guys,

@doomedraven told me about this issue. I have no time lately to take care of peepdf issues like I should, but I will have it soon (I hope!). Meanwhile, I added some comments and suggestions in https://github.com/jesparza/peepdf/issues/59.

Thanks!

mallorybobalice commented 7 years ago

Sad face

jgajek commented 7 years ago

@spender-sandbox Any reason why an older version of peepdf is being included in this repo? I believe

pip install peepdf

works now and installs release 0.3.2. Might be a good idea to move to a newer version of code before spending time on troubleshooting any performance issues.

SeanKim777 commented 7 years ago

tested with latest peepdf V0.3.2 but result worse than v0.3 r235(embeded version into cuckoo at the moment) peepdf v0.3 r234 --> 31.347 seconds peepdf v0.3.2 --> 45.880 seconds

below is what I used to test code but V0.3.2 was tested under virtualenv though and each execution changed import line for each peepdf version `import timeit

s = """ from lib.cuckoo.common.peepdf.PDFCore import PDFParser # older version from lib.cuckoo.common.peepdf_latest.PDFCore import PDFParser # newest version parser=PDFParser() try: ret, pdf = parser.parse("/cuckoo/storage/binaries/1e22b46798975d097444c7cfe45f66b6694cffafb358ad53a020460bfa914174", forceMode=True, looseMode=True, manualAnalysis=False) except Exception as e: print str(e) """

timeit.Timer(stmt=s).repeat(number=1) `

mallorybobalice commented 7 years ago

Please wrap that into interrupting cow as above, make it time out and print the stack trace aka traceback. to confirm Where in peepdf it happens Also please cite the file size to simplify, or if you can provide cmd line analysis results to help @jesparza narrow it down. Hopefully we can get a couple of examples to possibly turn off js emulation in csb calls idk. (Or as a ui flag and config item) (if you can check the peepdf function params and rerun with js emu off aka what -m does) I'd love to help and will try but I'm not sure personal life circumstances will allow. So any info supplied as per above may help resolve this and you seem as keen as I do @seankim777. I have a feeling unless we help @jesparza and @spender-sandbox |on this it may be a while. I don't think many people run cm as batch and appreciate the impact this has if we have a stream of samples. And time constraints Or perhaps don't see as many pdfs of the sort How am doing at riling?

mallorybobalice commented 7 years ago

Ps edited plz check back

jgajek commented 7 years ago

If no one has any objections, I will take a look at switching to the packaged version of peepdf. The way the static analysis module interfaces with peepdf also needs a bit of work. There is no point in passing stream objects with random garbage into analyseJS(). On the other hand, JS code can also be contained in PDF string objects, not just streams.

Ryuchen commented 7 years ago

This is my code to upgrade the peepdf package and do some code beautify.

class PDF(object): """PDF Analysis."""

def __init__(self, file_path):
    self.file_path = file_path
    self.pdf = None
    self.base_uri = ""

def _get_obj_val(self, version, obj):
    try:
        if obj.type == "reference":
            return self.pdf.body[version].getObject(obj.id)
    except:
        pass
    return obj

def _set_base_uri(self):
    try:
        for version in range(self.pdf.updates + 1):
            trailer, streamTrailer = self.pdf.trailer[version]
            if trailer is not None:
                elem = trailer.dict.getElementByName("/Root")
                elem = self._get_obj_val(version, elem)
                elem = elem.getElementByName("/URI")
                elem = self._get_obj_val(version, elem)
                elem = elem.getElementByName("/Base")
                elem = self._get_obj_val(version, elem)
                self.base_uri = elem.getValue()
    except:
        pass

def _parse(self, filepath):
    """Parses the PDF for static information. Uses PyV8 from peepdf to
    extract JavaScript from PDF objects.
    @param filepath: Path to file to be analyzed.
    @return: results dict or None.
    """
    # Load the PDF with PDFiD and convert it to JSON for processing
    PDF_data = PDFiD(filepath, False, True)
    PDF_json = PDFiD2JSON(PDF_data, True)
    PDFid_data = json.loads(PDF_json)[0]

    info = {
        "PDF Header": PDFid_data['pdfid']['header'],
        "Total Entropy": PDFid_data['pdfid']['totalEntropy'],
        'Entropy In Streams': PDFid_data['pdfid']['streamEntropy'],
        'Entropy Out Streams': PDFid_data['pdfid']['nonStreamEntropy'],
        'Count %% EOF': PDFid_data['pdfid']['countEof'],
        'Data After EOF': PDFid_data['pdfid']['countChatAfterLastEof']
    }

    # Note, PDFiD doesn't interpret some dates properly, specifically it doesn't
    # seem to be able to properly represent time zones that involve fractions of
    # an hour
    Dates = PDFid_data['pdfid']['dates']['date']

    keywords = {}
    for keyword in PDFid_data['pdfid']['keywords']['keyword']:
        keywords[str(keyword['name'])] = keyword['count']

    result = {}
    PDF_result = result["PDF"] = {}
    PDF_result["Info"] = info
    PDF_result["Dates"] = Dates
    PDF_result["Keywords"] = keywords

    PDF_parser = peepdf.PDFCore.PDFParser()

    ret, self.pdf = PDF_parser.parse(filepath, forceMode=True, looseMode=True, manualAnalysis=False)

    urlset = set()
    annoturiset = set()
    retobjects = []
    metadata = dict()

    self._set_base_uri()

    for i in range(len(self.pdf.body)):
        body = self.pdf.body[i]
        metatmp = self.pdf.getBasicMetadata(i)
        if metatmp:
            metadata = metatmp

        objects = body.objects

        for index in objects:
            oid = objects[index].id
            offset = objects[index].offset
            size = objects[index].size
            details = objects[index].object

            obj_data = {"Object ID": oid, "Offset": offset, "Size": size}
            if details.type == 'stream':
                decoded_stream = details.decodedStream
                if not peepdf.JSAnalysis.isJavascript(decoded_stream):
                    continue
                else:
                    try:
                        jslist, unescapedbytes, urlsfound, errors, ctxdummy = peepdf.JSAnalysis.analyseJS(decoded_stream)
                        jsdata = jslist[0]
                    except Exception as e:
                        continue
                    if len(errors):
                        continue
                    if jsdata is None:
                        continue

                    for url in urlsfound:
                        urlset.add(url)

                    # The following loop is required to "JSONify" the strings returned from PyV8.
                    # As PyV8 returns byte strings, we must parse out bytecode and
                    # replace it with an escape '\'. We can't use encode("string_escape")
                    # as this would mess up the new line representation which is used for
                    # beautifying the javascript code for Django's web interface.
                    ret_data = ""
                    for x in xrange(len(jsdata)):
                        if ord(jsdata[x]) > 127:
                            tmp = "\\x" + str(jsdata[x].encode("hex"))
                        else:
                            tmp = jsdata[x]
                        ret_data += tmp

                    obj_data["Data"] = ret_data
                    retobjects.append(obj_data)
            elif details.type == "dictionary" and details.hasElement("/A"):
                # verify it to be a link type annotation
                subtype_elem = details.getElementByName("/Subtype")
                type_elem = details.getElementByName("/Type")
                if not subtype_elem or not type_elem:
                    continue
                subtype_elem = self._get_obj_val(i, subtype_elem)
                type_elem = self._get_obj_val(i, type_elem)
                if subtype_elem.getValue() != "/Link" or type_elem.getValue() != "/Annot":
                    continue
                a_elem = details.getElementByName("/A")
                a_elem = self._get_obj_val(i, a_elem)
                if a_elem.type == "dictionary" and a_elem.hasElement("/URI"):
                    uri_elem = a_elem.getElementByName("/URI")
                    uri_elem = self._get_obj_val(i, uri_elem)
                    annoturiset.add(self.base_uri + uri_elem.getValue())
            else:
                pass

        PDF_result["JSStreams"] = retobjects

    if "creator" in metadata and metadata["creator"]:
        PDF_result["Info"]["Creator"] = to_unicode(metadata["creator"])
    if "producer" in metadata and metadata["producer"]:
        PDF_result["Info"]["Producer"] = to_unicode(metadata["producer"])
    if "author" in metadata and metadata["author"]:
        PDF_result["Info"]["Author"] = to_unicode(metadata["author"])

    if len(urlset):
        PDF_result["JS_URLs"] = list(urlset)
    if len(annoturiset):
        PDF_result["Annot_URLs"] = list(annoturiset)

    statsDict = self.pdf.getStats()
    # Basic info
    basicDict = {}
    basicDict['detection'] = {}
    if statsDict['Detection'] != [] and statsDict['Detection'] is not None:
        basicDict['detection']['rate'] = '%d/%d' % (statsDict['Detection'][0], statsDict['Detection'][1])
        basicDict['detection']['report_link'] = statsDict['Detection report']
    basicDict['pdf_version'] = statsDict['Version']
    basicDict['binary'] = bool(statsDict['Binary'])
    basicDict['linearized'] = bool(statsDict['Linearized'])
    basicDict['encrypted'] = bool(statsDict['Encrypted'])
    basicDict['encryption_algorithms'] = []
    if statsDict['Encryption Algorithms']:
        for algorithmInfo in statsDict['Encryption Algorithms']:
            basicDict['encryption_algorithms'].append({'bits': algorithmInfo[1], 'algorithm': algorithmInfo[0]})
    basicDict['updates'] = int(statsDict['Updates'])
    basicDict['num_objects'] = int(statsDict['Objects'])
    basicDict['num_streams'] = int(statsDict['Streams'])
    basicDict['comments'] = int(statsDict['Comments'])
    basicDict['errors'] = []
    for error in statsDict['Errors']:
        basicDict['errors'].append(error)
    # Advanced info
    advancedInfo = []
    for version in range(len(statsDict['Versions'])):
        statsVersion = statsDict['Versions'][version]
        if version == 0:
            versionType = 'original'
        else:
            versionType = 'update'
        versionInfo = {}
        versionInfo['version_number'] = version
        versionInfo['version_type'] = versionType
        versionInfo['catalog'] = statsVersion['Catalog']
        versionInfo['info'] = statsVersion['Info']
        if statsVersion['Objects'] is not None:
            versionInfo['objects'] = statsVersion['Objects'][1]
        else:
            versionInfo['objects'] = []
        if statsVersion['Compressed Objects'] is not None:
            versionInfo['compressed_objects'] = statsVersion['Compressed Objects'][1]
        else:
            versionInfo['compressed_objects'] = []
        if statsVersion['Errors'] is not None:
            versionInfo['error_objects'] = statsVersion['Errors'][1]
        else:
            versionInfo['error_objects'] = []
        if statsVersion['Streams'] is not None:
            versionInfo['streams'] = statsVersion['Streams'][1]
        else:
            versionInfo['streams'] = []
        if statsVersion['Xref Streams'] is not None:
            versionInfo['xref_streams'] = statsVersion['Xref Streams'][1]
        else:
            versionInfo['xref_streams'] = []
        if statsVersion['Encoded'] is not None:
            versionInfo['encoded_streams'] = statsVersion['Encoded'][1]
        else:
            versionInfo['encoded_streams'] = []
        if versionInfo['encoded_streams'] and statsVersion['Decoding Errors'] is not None:
            versionInfo['decoding_error_streams'] = statsVersion['Decoding Errors'][1]
        else:
            versionInfo['decoding_error_streams'] = []
        if statsVersion['Objects with JS code'] is not None:
            versionInfo['js_objects'] = statsVersion['Objects with JS code'][1]
        else:
            versionInfo['js_objects'] = []
        elements = statsVersion['Elements']
        elementArray = []
        if elements:
            for element in elements:
                elementInfo = {'name': element}
                if element in vulnsDict:
                    elementInfo['vuln_name'] = vulnsDict[element][0]
                    elementInfo['vuln_cve_list'] = vulnsDict[element][1]
                elementInfo['objects'] = elements[element]
                elementArray.append(elementInfo)
        vulns = statsVersion['Vulns']
        vulnArray = []
        if vulns:
            for vuln in vulns:
                vulnInfo = {'name': vuln}
                if vuln in vulnsDict:
                    vulnInfo['vuln_name'] = vulnsDict[vuln][0]
                    vulnInfo['vuln_cve_list'] = vulnsDict[vuln][1]
                vulnInfo['objects'] = vulns[vuln]
                vulnArray.append(vulnInfo)
        versionInfo['suspicious_elements'] = {'triggers': statsVersion['Events'],
                                              'actions': statsVersion['Actions'],
                                              'elements': elementArray,
                                              'js_vulns': vulnArray,
                                              'urls': statsVersion['URLs']}
        versionReport = {'version_info': versionInfo}
        advancedInfo.append(versionReport)
    jsonDict = {
            'basic': basicDict,
            'advanced': advancedInfo
        }
    result["stats"] = jsonDict

    return result

def run(self):
    """Run analysis.
    @return: analysis results dict or None.
    """
    if not os.path.exists(self.file_path):
        return None
    results = self._parse(self.file_path)
    return results

Remember to import packages : import peepdf.PDFCore import peepdf.JSAnalysis

from peepdf.PDFCore import vulnsDict

Ryuchen commented 7 years ago

And I also write a threat_time out method to avoid parsing too long, but it still no so functionally. Can anyone to help to review my code?

import time
import wrapt
import ctypes
import threading
from Queue import Queue

def _kill_thread(thread):
    SE = ctypes.py_object(SystemExit)
    tr = ctypes.c_long(thread.ident)
    ctypes.pythonapi.PyThreadState_SetAsyncExc(tr, SE)

class ExecTimeout(BaseException):
    pass

class KilledExecTimeout(ExecTimeout):
    # print (ExecTimeout)
    pass

class FailedKillExecTimeout(ExecTimeout):
    # print (ExecTimeout)
    pass

class NotKillExecTimeout(ExecTimeout):
    # print (ExecTimeout)
    pass

def thread_timeout(delay, kill=True, kill_wait=0.04):
    @wrapt.decorator
    def wrapper(wrapped, instance, args, kwargs):
        queue = Queue()

        def inner_worker():
            result = wrapped(*args, **kwargs)
            queue.put(result)

        thread = threading.Thread(target=inner_worker)
        thread.daemon = True
        thread.start()
        thread.join(delay)
        if thread.isAlive():
            if not kill:
                raise NotKillExecTimeout("Timeout and no kill attempt")
            _kill_thread(thread)
            time.sleep(kill_wait)
            # FIXME isAlive is giving fals positive results
            if thread.isAlive():
                raise FailedKillExecTimeout("Timeout, thread refuses to die in %s seconds" % kill_wait)
            else:
                raise KilledExecTimeout("Timeout and thread was killed")
        return queue.get()
    return wrapper

In static.py file:

@thread_timeout(60)

def pdf_worker():
    return PDF(self.file_path).run()

and

elif "PDF" in fileType or self.task["target"].endswith(".pdf"):
    static = pdf_worker()

jgajek commented 7 years ago

@SeanKim777 You are invoking parser.parse() with the manualAnalysis flag set to False. This will automatically do JS emulation. What are the runtimes if you set manualAnalysis to True?

SeanKim777 commented 7 years ago

@jgajek @mallorybobalice Average execution time below for each version with manualAnalysis=True. I have uploaded pdf file I have used here

Older version 0.3 r235: 22.107 seconds Newer version 0.3 r275: 46.525 seconds

====== Old version ====== $ python Python 2.7.6 (default, Jun 22 2015, 17:58:13) [GCC 4.8.2] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import timeit

s = """ ... from lib.cuckoo.common.peepdf.PDFCore import PDFParser # Old version ... #from peepdf.PDFCore import PDFParser # newest version ... parser=PDFParser() ... try: ... ret, pdf = parser.parse("/cuckoo/storage/binaries/1e22b46798975d097444c7cfe45f66b6694cffafb358ad53a020460bfa914174", forceMode=True, looseMode=True, manualAnalysis=True) ... except Exception as e: ... print str(e) ... """

timeit.Timer(stmt=s).repeat(number=1) [22.152732849121094, 21.857624053955078, 22.31332015991211]

====== Newer version ====== $ python Python 2.7.6 (default, Jun 22 2015, 17:58:13) [GCC 4.8.2] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import timeit

s = """ ... #from lib.cuckoo.common.peepdf.PDFCore import PDFParser # Old version ... from peepdf.PDFCore import PDFParser # newest version ... parser=PDFParser() ... try: ... ret, pdf = parser.parse("/cuckoo/storage/binaries/1e22b46798975d097444c7cfe45f66b6694cffafb358ad53a020460bfa914174", forceMode=True, looseMode=True, manualAnalysis=True) ... except Exception as e: ... print str(e) ... """

timeit.Timer(stmt=s).repeat(number=1) [50.08627700805664, 44.63256788253784, 44.858914852142334]

jgajek commented 7 years ago

@SeanKim777 Thanks. There is no JavaScript in the document, but it is quite large with a lot of revisions, so it is taking a long time to parse. I'll try to take a look at the parser code when I have some time.

jesparza commented 7 years ago

Hi there!

I just added a new comment in https://github.com/jesparza/peepdf/issues/59 after doing some quick tests and profilings. I hope that helps. In general, using -m and avoiding -l things should go faster, but the analysis might be inaccurate as -l is used to ignore the endobj tags while parsing. That can be used by the bad guys as anti-analysis technique. The -m flag can avoid emulating some JS code and get a shellcode, for instance, but I guess you have to choose between accuracy and time. Ideally, the emulation could be stopped when takes longer and the parsing method improved to be faster, but using regex to parse the document I am not sure if that will be possible...

mallorybobalice commented 7 years ago

hey guys, @jesparza, @Ryuchen and @spender-sandbox sooo.... I had a quick look at the code above from Ryuchen (above) vs static pdf and... to be honest I quite like what he did. Pity no one raised it earlier? (no PR = lack of attention?)

a)i) current static.py doesn't do isJavascript checks at all. (are they implicitly called in parse as well?)

ii) both code version call ret, self.pdf = parser.parse(filepath, forceMode=True, looseMode=True, manualAnalysis=False)

should this be switched to manualAnalysis=True? (basically @SeanKim777 was demoing a doc with little JS in it, i'm getting smaller docs with quite a bit of it... )

b) @Ryuchen
definition missing? objects = [](after annoturiset)

c) @Ryuchen
pdf->PDF including in strings... the latter might not be good e.g result["pdf"]->result["PDF"] ^does that make a difference

d) @Ryuchen
analyseJS(decoded_stream.strip()) vs analyseJS(decoded_stream) _clean_string replaced with to_unicode why? tho i suppose re unicode - maybe why not ?

e)@Ryuchen

if not peepdf.JSAnalysis.isJavascript(decoded_stream): continue, else jslist, unescapedbytes, urlsfound, errors, ctxdummy = **peepdf.JSAnalysis.analyseJS(

ok i'm pretty keen on that. (ifing the strings for is js ) I know @jesparza said it's not perfect but probs better to call it rather than not at all.

f) statsDict so this bit is completely new and you essentially add the static pdf info into the results dict, including comparison against vuln dict? i think this is probably how the module should work. (so we can, you know... get the pdf info in the pdf info section :)) :+1: does anyone see any problems with that?

g) @Ryuchen re py threading - unfortunately i can't comment. i use interrupting cow and that uses signals and timers + try catch. at a high level the code looks ok (run, join with timeout) check if alive/kill or for other reason, but why >isAlive is giving fals positive results that happens and what the correct way to fix if it's a subtle thread code pattern missing - not sure . ( i had a look around if there were any higher level libraries with task primitives that include timeoutable tasks - but no luck)

h) @ everyone. re what version we use... it's not exactly a small difference...

Older version 0.3 r235: 22.107 seconds Newer version 0.3 r275: 46.525 seconds @jgajek thanks for raising and @SeanKim777 - thanks for testing.

should the newer one be merged in and why/has some sort of a default changed or bug?

TLDR - can people please peer review @Ryuchen alteration to static.py pdf class, and merge it in after confirming Re above, and perhaps it requiring changing manualAnalysis to True along with the other things(?!) ?

or at least keep the discussion going .

Ryuchen commented 7 years ago

Thanks @mallorybobalice ! I had used interruptingcow, but i also not so functionally. It will run properly in a single process, but when I use multiprocess it can't run. I search a lot of blogs to solve this problem, but no idea. So i write that thead method to kill it. I can use only in the function without try: except, so it can't run with peepdf .

jgajek commented 7 years ago

@mallorybobalice I'm working on a PR (https://github.com/jgajek/cuckoo-modified), it just needs to cook a little bit more to make sure there are no regressions. I should also probably make the manualAnalysis flag configurable in processing.conf. Should have some time this week to finalize.

Since at least some of the performance problems appear to be attributable to parsing performance, they can only be addressed with an overhaul of the peepdf parser code itself -- I imagine a Flex/Bison implementation would be plenty fast, but I wonder if it's worth the effort if most malicious PDFs are small.

Ryuchen commented 7 years ago

Finally I have a way to solve the pdf to block the processing, but can not solve parsing too long. Here is my code: In /modules/processing/static.py

class PDF(object):
    """PDF Analysis."""

    def __init__(self, file_path):
        self.file_path = file_path
        self.pdf = None
        self.base_uri = ""

    def _hexencode(self, value):
        ret = ""
        for x in xrange(len(value)):
            if ord(value[x]) > 127:
                tmp = "\\x" + str(value[x].encode("hex"))
            else:
                tmp = value[x]
            ret += tmp
        return ret

    def _get_obj_val(self, version, obj):
        try:
            if obj.type == "reference":
                return self.pdf.body[version].getObject(obj.id)
        except:
            pass
        return obj

    def _set_base_uri(self, version):
        self.base_uri = ""
        try:
            trailer, streamTrailer = self.pdf.trailer[version]
            if trailer is not None:
                elem = trailer.dict.getElementByName("/Root")
                elem = self._get_obj_val(version, elem)
                elem = elem.getElementByName("/URI")
                elem = self._get_obj_val(version, elem)
                elem = elem.getElementByName("/Base")
                elem = self._get_obj_val(version, elem)
                self.base_uri = elem.getValue()
        except:
            pass

    def _parse(self, filepath):
        """Parses the PDF for static information. Uses PyV8 from peepdf to
        extract JavaScript from PDF objects.
        @param filepath: Path to file to be analyzed.
        @return: results dict or None.
        """
        # Load the PDF with PDFiD and convert it to JSON for processing
        PDF_data = PDFiD(filepath, False, True)
        PDF_json = PDFiD2JSON(PDF_data, True)
        PDFid_data = json.loads(PDF_json)[0]

        info = {
            "PDF Header": PDFid_data['pdfid']['header'],
            "Total Entropy": PDFid_data['pdfid']['totalEntropy'],
            'Entropy In Streams': PDFid_data['pdfid']['streamEntropy'],
            'Entropy Out Streams': PDFid_data['pdfid']['nonStreamEntropy'],
            'Count %% EOF': PDFid_data['pdfid']['countEof'],
            'Data After EOF': PDFid_data['pdfid']['countChatAfterLastEof']
        }

        # Note, PDFiD doesn't interpret some dates properly, specifically it doesn't
        # seem to be able to properly represent time zones that involve fractions of
        # an hour
        Dates = PDFid_data['pdfid']['dates']['date']

        keywords = {}
        for keyword in PDFid_data['pdfid']['keywords']['keyword']:
            keywords[str(keyword['name'])] = keyword['count']

        result = {}
        PDF_result = result["PDF"] = {}
        PDF_result["Info"] = info
        PDF_result["Dates"] = Dates
        PDF_result["Keywords"] = keywords

        PDF_parser = peepdf.PDFCore.PDFParser()

        ret, self.pdf = PDF_parser.parse(filepath, forceMode=True, looseMode=True, manualAnalysis=False)

        urlset = set()
        annoturiset = set()
        retobjects = []
        metadata = dict()

        jsPerBody = self.pdf.getJavascriptCode(perObject=True)
        for version, jsPerObject in enumerate(jsPerBody):
            metadata.update(self.pdf.getBasicMetadata(version))
            for oid, js in jsPerObject:
                obj = self.pdf.getObject(oid, version, True)
                if obj:
                    obj_data = {}
                    obj_data["Object ID"] = oid
                    obj_data["Version"] = version
                    obj_data["Offset"] = obj.getOffset()
                    obj_data["Size"] = obj.getSize()
                    obj_data["Type"] = obj.getObject().getType()
                    obj_data["Errors"] = obj.getObject().getErrors()
                    obj_data["Data"] = self._hexencode(js)
                    retobjects.append(obj_data)

        for url in self.pdf.getURLs():
            urlset.add(url)

        for version, uriList in enumerate(self.pdf.getURIs()):
            self._set_base_uri(version)
            for uri in uriList:
                annoturiset.add(self.base_uri + uri)

        PDF_result["JSStreams"] = retobjects

        if "creator" in metadata and metadata["creator"]:
            PDF_result["Info"]["Creator"] = to_unicode(metadata["creator"])
        if "producer" in metadata and metadata["producer"]:
            PDF_result["Info"]["Producer"] = to_unicode(metadata["producer"])
        if "author" in metadata and metadata["author"]:
            PDF_result["Info"]["Author"] = to_unicode(metadata["author"])

        if len(urlset):
            PDF_result["JS_URLs"] = list(urlset)
        if len(annoturiset):
            PDF_result["Annot_URLs"] = list(annoturiset)

        statsDict = self.pdf.getStats()
        # Basic info
        basicDict = {}
        basicDict['detection'] = {}
        if statsDict['Detection'] != [] and statsDict['Detection'] is not None:
            basicDict['detection']['rate'] = '%d/%d' % (statsDict['Detection'][0], statsDict['Detection'][1])
            basicDict['detection']['report_link'] = statsDict['Detection report']
        basicDict['pdf_version'] = statsDict['Version']
        basicDict['binary'] = bool(statsDict['Binary'])
        basicDict['linearized'] = bool(statsDict['Linearized'])
        basicDict['encrypted'] = bool(statsDict['Encrypted'])
        basicDict['encryption_algorithms'] = []
        if statsDict['Encryption Algorithms']:
            for algorithmInfo in statsDict['Encryption Algorithms']:
                basicDict['encryption_algorithms'].append({'bits': algorithmInfo[1], 'algorithm': algorithmInfo[0]})
        basicDict['updates'] = int(statsDict['Updates'])
        basicDict['num_objects'] = int(statsDict['Objects'])
        basicDict['num_streams'] = int(statsDict['Streams'])
        basicDict['comments'] = int(statsDict['Comments'])
        basicDict['errors'] = []
        for error in statsDict['Errors']:
            basicDict['errors'].append(error)
        # Advanced info
        advancedInfo = []
        for version in range(len(statsDict['Versions'])):
            statsVersion = statsDict['Versions'][version]
            if version == 0:
                versionType = 'original'
            else:
                versionType = 'update'
            versionInfo = {}
            versionInfo['version_number'] = version
            versionInfo['version_type'] = versionType
            versionInfo['catalog'] = statsVersion['Catalog']
            versionInfo['info'] = statsVersion['Info']
            if statsVersion['Objects'] is not None:
                versionInfo['objects'] = statsVersion['Objects'][1]
            else:
                versionInfo['objects'] = []
            if statsVersion['Compressed Objects'] is not None:
                versionInfo['compressed_objects'] = statsVersion['Compressed Objects'][1]
            else:
                versionInfo['compressed_objects'] = []
            if statsVersion['Errors'] is not None:
                versionInfo['error_objects'] = statsVersion['Errors'][1]
            else:
                versionInfo['error_objects'] = []
            if statsVersion['Streams'] is not None:
                versionInfo['streams'] = statsVersion['Streams'][1]
            else:
                versionInfo['streams'] = []
            if statsVersion['Xref Streams'] is not None:
                versionInfo['xref_streams'] = statsVersion['Xref Streams'][1]
            else:
                versionInfo['xref_streams'] = []
            if statsVersion['Encoded'] is not None:
                versionInfo['encoded_streams'] = statsVersion['Encoded'][1]
            else:
                versionInfo['encoded_streams'] = []
            if versionInfo['encoded_streams'] and statsVersion['Decoding Errors'] is not None:
                versionInfo['decoding_error_streams'] = statsVersion['Decoding Errors'][1]
            else:
                versionInfo['decoding_error_streams'] = []
            if statsVersion['Objects with JS code'] is not None:
                versionInfo['js_objects'] = statsVersion['Objects with JS code'][1]
            else:
                versionInfo['js_objects'] = []
            elements = statsVersion['Elements']
            elementArray = []
            if elements:
                for element in elements:
                    elementInfo = {'name': element}
                    if element in vulnsDict:
                        elementInfo['vuln_name'] = vulnsDict[element][0]
                        elementInfo['vuln_cve_list'] = vulnsDict[element][1]
                    elementInfo['objects'] = elements[element]
                    elementArray.append(elementInfo)
            vulns = statsVersion['Vulns']
            vulnArray = []
            if vulns:
                for vuln in vulns:
                    vulnInfo = {'name': vuln}
                    if vuln in vulnsDict:
                        vulnInfo['vuln_name'] = vulnsDict[vuln][0]
                        vulnInfo['vuln_cve_list'] = vulnsDict[vuln][1]
                    vulnInfo['objects'] = vulns[vuln]
                    vulnArray.append(vulnInfo)
            versionInfo['suspicious_elements'] = {'triggers': statsVersion['Events'],
                                                  'actions': statsVersion['Actions'],
                                                  'elements': elementArray,
                                                  'js_vulns': vulnArray,
                                                  'urls': statsVersion['URLs']}
            versionReport = {'version_info': versionInfo}
            advancedInfo.append(versionReport)
        jsonDict = {
                'basic': basicDict,
                'advanced': advancedInfo
            }
        result["stats"] = jsonDict

        return result

    def run(self):
        """Run analysis.
        @return: analysis results dict or None.
        """
        if not os.path.exists(self.file_path):
            return None
        try:
            results = self._parse(self.file_path)
        except:
            results = dict()
        return results

This thanks to @jgajek , I read about you code and beautify my code.

In /utils/process.py

from interruptingcow import timeout

……

def process(target=None, copy_path=None, task=None, report=False, auto=False):
    try:
        with timeout(360, exception=RuntimeError):
            # This is the results container. It's what will be used by all the
            # reporting modules to make it consumable by humans and machines.
            # It will contain all the results generated by every processing
            # module available. Its structure can be observed through the JSON
            # dump in the analysis' reports folder. (If jsondump is enabled.)
            results = {"statistics": {}}
            results["statistics"]["processing"] = list()
            results["statistics"]["signatures"] = list()
            results["statistics"]["reporting"] = list()
            GetFeeds(results=results).run()
            RunProcessing(task=task, results=results).run()
            RunSignatures(task=task, results=results).run()

            task_id = task["id"]
            if report:
                repconf = Config("reporting")
                host = repconf.mongodb.host
                port = repconf.mongodb.port
                db = repconf.mongodb.db
                conn = MongoClient(host, port)
                mdata = conn[db]
                analyses = mdata.analysis.find({"info.id": int(task_id)})
                if analyses.count() > 0:
                    log.debug("Deleting analysis data for Task %s" % task_id)
                    for analysis in analyses:
                        for process in analysis["behavior"]["processes"]:
                            for call in process["calls"]:
                                mdata.calls.remove({"_id": ObjectId(call)})
                        mdata.analysis.remove({"_id": ObjectId(analysis["_id"])})
                conn.close()
                log.debug("Deleted previous MongoDB data for Task %s" % task_id)

                RunReporting(task=task, results=results).run()
                Database().set_status(task_id, TASK_REPORTED)

                if auto:
                    if cfg.hawkeye.delete_original and os.path.exists(target):
                        os.unlink(target)

                    if cfg.hawkeye.delete_bin_copy and copy_path \
                            and os.path.exists(copy_path) and (results["malscore"] <= cfg.safereport.hawkeyescore):
                        os.unlink(copy_path)
    except RuntimeError:
        log.warning("Does not finish running processing in 6 min!")

And this will be escape the block pdf parsing!

spender-sandbox / cuckoo-modified

PDF parsing too long #297