honeynet / cuckooml

CuckooML: Machine Learning for Cuckoo Sandbox
https://honeynet.github.io/cuckooml/
146 stars 52 forks source link

Only one set of features is used for clustering #14

Closed ghost closed 7 years ago

ghost commented 7 years ago

Hi,

thanks for sharing this project! I am in the process of adding features to the nominal feature set. In that process I noticed that my changes were not taken into account in the clustering results, even though I specified nominal in the configuration. I believe the reason is that the code that handles the configuration settings is using an if... elif construct, which will lead to only choosing one set of features. Relevant code snippet is:

    # Select features                               
    selected_features = []                          
    sf = [i.strip() for i in cfg.cuckooml.features.split(",")]
    if "simple" in sf:
        selected_features.append(simple_features)
    elif "nominal" in sf:
        selected_features.append(features_nominal)
    elif "numerical" in sf:
        selected_features.append(features_numerical)
ghost commented 7 years ago

I still encountered the issue after changing the conditionals to reflect the intention of the author (appending). The reason has to do with the ensuing for loop:

    # Apply filters to selected datasets
    filters = [i.strip() for i in cfg.cuckooml.features_filter.split(",")]
    data = []
    for f, d in itertools.izip(filters, selected_features):
        if f == "log_bin":
            data.append(d.applymap(ml.__log_bin))
        elif f == "filter_dataset":                             #D: Only runs once!
            print "RUNs\n"
            data.append(ml.filter_dataset(d))

If you only specified "filter_dataset" in the configuration, this for loop will only run once. To get around that, just add another element to the filters data structure in the configuration file:

  features_filter = filter_dataset, filter_dataset
So-Cool commented 7 years ago

Hi, well spotted. This would not work if you put more than one set of features in the config file. I have fixed this with 859bec5.

You are right, the configuration file takes pairs of: feature, fileter; e.g.

features = simple, nominal
features_filter = filter_dataset, filter_dataset