Useful malware features

So-Cool commented 8 years ago

The base of ML features for binaries analysed by Cuckoo is going to be inspired by Reviewer Integration and Performance Measurement for Malware Detection by B Miller et al (available here).
They name all kind of binary features both static and dynamic which seems a good starting point for this project:

static attributes:
- binary metadata,
- digital signing,
- heuristic tools,
- packer detection,
- portable executable format,
- static imports;
dynamic attributes:
- dynamic imports, mutexes, processes,
- filesystem operations,
- network operations,
- registry operations,
- Windows API calls.

Once implemented they should be reviewed and revised with regard to usability for this project.

So-Cool commented 8 years ago

The very basic implementation of the above features is complete. load_features method placed in ML class (modules/cuckooml/cuckooml.py file) requires some enhancements though. All of them are explained in the comments and marked with TODO flag.

ghost commented 8 years ago

Hi So-Cool,

I read your blog post on this issue: "The problems I’m aware of are Windows API calls and filesystem operations. I could find overview of API calls in behavior->apistats but the paper mentions that the exact sequence can be extracted form the “raw Cuckoo sandbox output”. Is it located in strings in the JSONs? Any ideas how to extract it?" -- http://honeynet.github.io/cuckooml/2016/06/19/static-features/

The dataset that you distributed in another blog post does not contain the API call sequences, I'm not sure why that is. But if you run a sample through Cuckoo you will get access to the calls a process makes in the following way:

    processes = self.report.get("behavior", {}).get("processes", {})
    for p in processes:
        apicalls = p.get("calls", {})
        for a in apicalls:
            api = a.get("api", {})

I'm currently attempting to implement the ideas expressed in the paper mentioned above on API call sequences, would be glad to discuss approaches for feature construction (how to represent an API call and a sequence of three) and how to vectorize it. Are you still working on CuckooML?

hgascon commented 8 years ago

@dueland you might want to check scikit-learn CountVectorizer

ghost commented 8 years ago

@hgascon thanks for the link. I propose the following:

build a string representation of three consecutive API calls
hash it using hashlib md5()

vectorize the hash with n-grams for n=3, with the aproach taken in CuckooML:

  for ngram_api in self.__handle_ssdeep(str(features[i]["api_seq"])):
           my_features[i][":simp:impssdeep:" + ngram_api] = 1

Can you spot any shortcomings with that approach? And is this approach using what is known as the hashing trick?

UPDATE:

My supervisor discouraged using n-grams at all, instead he suggested using the hash as the index.
Instead of setting the hash simply as present (hash = 1), we are considering counting the occurences of the hash (hash += 1).
Instead of iterating through the list of API calls in steps of 3 to retrieve a sequence of API calls, an idea is to iterate one step at a time and then construct a sequence of adjacent elements as before. Difference is that it would not be arbitrary which sequences arise. An example: [a, b, c, d, e, f] approach 1: abc, def approach 2: abc, bcd, cde, def
Use the non-cryptographic 32-bit xxhash instead of md5.

So-Cool commented 8 years ago

Sounds good. I also had an idea to build a transition network with weights representing number of transition of given type seen so far, but it's probably a bit more complicated.

honeynet / cuckooml

Useful malware features #5