mprhode / malware-prediction-rnn

RNN implementation with Keras for machine activity data to predict malware
Apache License 2.0
40 stars 22 forks source link
deep-learning keras machine-learning malware-detection

malware-prediction-rnn

RNN implementation in Keras to predict malware from machine activity data - code for experiments in Early Stage Malware Prediction Using Recurrent Neural Networks

https://github.com/mprhode/malware-prediction-rnn/assets/16882132/9ed1d25a-44be-4c49-b7cb-174d7c48decf

(^^Percentage certainty of model is for demo-purposes and should not be taken as an indicator of model reliability!)

Data (data_2.csv) available here http://doi.org/10.17035/d.2018.0050524986

Experiments from the paper are set out in order in run_experiments

Implementation uses Keras v2.0.6 and Python >= 3.4

If you use our code in your research please cite:

@article{RHODE2018578,
title = "Early-stage malware prediction using recurrent neural networks",
journal = "Computers & Security",
volume = "77",
pages = "578 - 594",
year = "2018",
issn = "0167-4048",
doi = "https://doi.org/10.1016/j.cose.2018.05.010",
url = "http://www.sciencedirect.com/science/article/pii/S0167404818305546",
author = "Matilda Rhode and Pete Burnap and Kevin Jones",
}

Experiment

The basic Experiment class in experiments > Experiments takes a dictionary of hyperparameters and data as objects (either as a tuple for k-fold cross validation or as four separate test/train inputs/labels items). Results are stored in a folder as a comma separated values file.

Increase_Snaphot_Experiment

Increase the temporal distance between input features. Add "steps" to parameters dictionary to increase the time interval between data, should be integer <= sequence_length

Ensemble_configurations

Average the results of multiple RNN models.Experiment will only search sequence_length space, and will take the first value provided for all other hyperparameters if more than one is supplied

Ensemble_sub_sequences

Average the results of classifying all sub-sequences and the entire data sequence. Experiment will only search sequence_length space, and will take the first value provided for all other hyperparameters if more than one is supplied

Omit_test_data

Leave all possible combinations of input features out of training to see impact of their omission. Trains a model then sequentially omits all possible combinations of 1,2,3...n, where n=total number of features, giving 2047 combinations for the 11 features used in the paper.

Omit training data

Leave a single feature out of testing and training.

RNN implementation

Takes a dictionary of parameters, the training data and testing data as input. Data used to determine shape of RNN layers. Possible options for configurations outlined in Configurations / RNN hyperparameters.

Configurations / RNN hyperparameters

The configuration dictionaries used in the paper are stored in experiments > Configs. The possible parameters which can be edited and passed into an experiment are as detailed in the table below N.B. these are wider than the limitations of the random search configuration. see the commented code for details of each hyperparameter.

Parameter Possible values Notes
layer_type "GRU", "LSTM" fixed as "GRU" in Configs
loss "binary_crossentropy" -
kernel_initializer "lecun_uniform" Can also be any of the initialisers listed in Keras
recurrent_initializer "lecun_uniform" Can also be any of the initialisers listed in Keras activation "sigmoid" -
"depth" integer => 1 -
"bidirectional" Boolean -
"hidden_neurons" integer => 1 -
"learning_rate": 0 <= float <= 1 will default to 0.001 if "adam" optimiser used
"optimiser": "adam", "sgd" -
"dropout": 0 <= float < 1 -
"b_l1_reg": 0 <= float < 1 -
"b_l2_reg": 0 <= float < 1 -
"r_l1_reg": 0 <= float < 1 -
"r_l2_reg": 0 <= float < 1 -
"epochs": integer > 1 -
"sequence_length": 1 < integer < 300 -
"batch_size": 1 < integer < 59 -
"description": string to describe parameters only needed for Ensemble_configurations
"step": integer => 1 only needed for Increase_Snaphot_Experiment
"leave_out_feature": 0 <= integer < number of input features (here 11) not necessary for code to work

Formatting hyperparameter configurations

Hyperparameters should be supplied as a dictionary with parameter name as the key and value(s) stored in a list or as the keys of dictionaries. If using dictionaries, also supply relative weights representing the frequency with which the values should be chosen in a random search (the frequencies will be ignored in a grid search) - the lists and dictionaries can be mixed together e.g:

{
# more parameters up here
["dropout"]: [0, 0.1, 0.2, 0.3],
["optimiser"]: {"adam": 0.75, "sgd":0.25}, # equivalent to {"adam": 3, "sgd": 1} as weights are relative
["epochs"]: list(range(0,1000)),
# more parameters down here
}