Open So-Cool opened 8 years ago
The very basic implementation of the above features is complete. load_features
method placed in ML
class (modules/cuckooml/cuckooml.py
file) requires some enhancements though. All of them are explained in the comments and marked with TODO flag.
Hi So-Cool,
I read your blog post on this issue: "The problems I’m aware of are Windows API calls and filesystem operations. I could find overview of API calls in behavior->apistats but the paper mentions that the exact sequence can be extracted form the “raw Cuckoo sandbox output”. Is it located in strings in the JSONs? Any ideas how to extract it?" -- http://honeynet.github.io/cuckooml/2016/06/19/static-features/
The dataset that you distributed in another blog post does not contain the API call sequences, I'm not sure why that is. But if you run a sample through Cuckoo you will get access to the calls a process makes in the following way:
processes = self.report.get("behavior", {}).get("processes", {})
for p in processes:
apicalls = p.get("calls", {})
for a in apicalls:
api = a.get("api", {})
I'm currently attempting to implement the ideas expressed in the paper mentioned above on API call sequences, would be glad to discuss approaches for feature construction (how to represent an API call and a sequence of three) and how to vectorize it. Are you still working on CuckooML?
@dueland you might want to check scikit-learn CountVectorizer
@hgascon thanks for the link. I propose the following:
vectorize the hash with n-grams for n=3, with the aproach taken in CuckooML:
for ngram_api in self.__handle_ssdeep(str(features[i]["api_seq"])):
my_features[i][":simp:impssdeep:" + ngram_api] = 1
Can you spot any shortcomings with that approach? And is this approach using what is known as the hashing trick?
UPDATE:
Sounds good. I also had an idea to build a transition network with weights representing number of transition of given type seen so far, but it's probably a bit more complicated.
The base of ML features for binaries analysed by Cuckoo is going to be inspired by Reviewer Integration and Performance Measurement for Malware Detection by B Miller et al (available here).
They name all kind of binary features both static and dynamic which seems a good starting point for this project:
Once implemented they should be reviewed and revised with regard to usability for this project.