malware-prediction-rnn

RNN implementation in Keras to predict malware from machine activity data - code for experiments in Early Stage Malware Prediction Using Recurrent Neural Networks

https://github.com/mprhode/malware-prediction-rnn/assets/16882132/9ed1d25a-44be-4c49-b7cb-174d7c48decf

(^^Percentage certainty of model is for demo-purposes and should not be taken as an indicator of model reliability!)

Data (data_2.csv) available here http://doi.org/10.17035/d.2018.0050524986

Experiments from the paper are set out in order in run_experiments

Implementation uses Keras v2.0.6 and Python >= 3.4

If you use our code in your research please cite:

@article{RHODE2018578,
title = "Early-stage malware prediction using recurrent neural networks",
journal = "Computers & Security",
volume = "77",
pages = "578 - 594",
year = "2018",
issn = "0167-4048",
doi = "https://doi.org/10.1016/j.cose.2018.05.010",
url = "http://www.sciencedirect.com/science/article/pii/S0167404818305546",
author = "Matilda Rhode and Pete Burnap and Kevin Jones",
}

Experiment

The basic Experiment class in experiments > Experiments takes a dictionary of hyperparameters and data as objects (either as a tuple for k-fold cross validation or as four separate test/train inputs/labels items). Results are stored in a folder as a comma separated values file.

parameters: A dictionary of hyperparamters such as in experiments > Configs. The keys relate to the RNN implementation and the values can either be a list or a dictionary of possible values with associated weighted probabilities of choosing them. The latter is intended to aid biased random searches e.g. params = {.... ["dropout"] = {0.2: 0.5, 0.1: 0.25, 0.3: 0.25}.... } will bias the random search to choose a value of 0.2 half of the time, and 0.1 or 0.3 respectively quarter of the time.
search_algorithm: {"grid", "random"}
- Grid search will explore every possible combination of parameters supplied to the Experiment. Grid search will keep running until all options have been exhausted.
- Random search will randomly select a configuration from the possible combinations of parameters, the choice of parameters can be biased by using dictionaries with values representing relative weights between the keys. See Configurations / RNN hyperparameters for more. Random search will run until the num_experiments parameter in Experiment.run() is reached, default=100.
x_train: sequential (3D) tensor of train input data supplied for a train-test experiment
y_train: sequential (2D) tensor of train label data supplied for a train-test experiment, corresponding to the indices of the x_train data
x_test: sequential (3D) tensor of test input data supplied for a train-test experiment
y_test: sequential (2D) tensor of test label data supplied for a train-test experiment, corresponding to the indices of the x_test data
data: tuple of (input, label) data for k-fold cross validation experiment
folds: integer to determine k in k-fold validation, defaults to 10. Must be an integer (or left default) for k-fold validation experiment along with data tuple
thresholding: Boolean to determine if k-fold test is cut short when accuracy falls below threshold, defaults to False. When thresholding=True, the threshold automatically increases if the average accuracy of k-folds is greater than the threshold. The new threshold is minimum of the set of k-fold accuracies.
threshold: 0 <= float < 1 which determines accuracy level cut off during a k-fold experiment. If a fold acheives lower than threshold, the remaining folds are not run, and the next configuration will begin. Automatically increases if a k-fold experiment achieved higher average accuracy (across k-folds) than threshold to the minimum of the set of k-fold accuracies.
folder_name: String to name folder in which csv file results are stored

Increase_Snaphot_Experiment

Increase the temporal distance between input features. Add "steps" to parameters dictionary to increase the time interval between data, should be integer <= sequence_length

Ensemble_configurations

Average the results of multiple RNN models.Experiment will only search sequence_length space, and will take the first value provided for all other hyperparameters if more than one is supplied

Pass a list of parameter dictionaries in place of parameters to Ensemble_configurations class to average the results of multiple models. Only the first element in the list of possible parameters will be used if more than one is supplied.
batch_size: int can be passed to Ensemble_configurations to use the same batch_size across models

Ensemble_sub_sequences

Average the results of classifying all sub-sequences and the entire data sequence. Experiment will only search sequence_length space, and will take the first value provided for all other hyperparameters if more than one is supplied

Omit_test_data

Leave all possible combinations of input features out of training to see impact of their omission. Trains a model then sequentially omits all possible combinations of 1,2,3...n, where n=total number of features, giving 2047 combinations for the 11 features used in the paper.

Omit training data

Leave a single feature out of testing and training.

supply "leave_out_features" to parameters dictionary to omit a single feature from training and testing

RNN implementation

Takes a dictionary of parameters, the training data and testing data as input. Data used to determine shape of RNN layers. Possible options for configurations outlined in Configurations / RNN hyperparameters.

Configurations / RNN hyperparameters

The configuration dictionaries used in the paper are stored in experiments > Configs. The possible parameters which can be edited and passed into an experiment are as detailed in the table below N.B. these are wider than the limitations of the random search configuration. see the commented code for details of each hyperparameter.

Parameter	Possible values	Notes
layer_type	"GRU", "LSTM"	fixed as "GRU" in Configs
loss	"binary_crossentropy"	-
kernel_initializer	"lecun_uniform"	Can also be any of the initialisers listed in Keras
recurrent_initializer	"lecun_uniform"	Can also be any of the initialisers listed in Keras	activation	"sigmoid"	-
"depth"	integer => 1	-
"bidirectional"	Boolean	-
"hidden_neurons"	integer => 1	-
"learning_rate":	0 <= float <= 1	will default to 0.001 if "adam" optimiser used
"optimiser":	"adam", "sgd"	-
"dropout":	0 <= float < 1	-
"b_l1_reg":	0 <= float < 1	-
"b_l2_reg":	0 <= float < 1	-
"r_l1_reg":	0 <= float < 1	-
"r_l2_reg":	0 <= float < 1	-
"epochs":	integer > 1	-
"sequence_length":	1 < integer < 300	-
"batch_size":	1 < integer < 59	-
"description":	string to describe parameters	only needed for Ensemble_configurations
"step":	integer => 1	only needed for Increase_Snaphot_Experiment
"leave_out_feature":	0 <= integer < number of input features (here 11)	not necessary for code to work

Formatting hyperparameter configurations

Hyperparameters should be supplied as a dictionary with parameter name as the key and value(s) stored in a list or as the keys of dictionaries. If using dictionaries, also supply relative weights representing the frequency with which the values should be chosen in a random search (the frequencies will be ignored in a grid search) - the lists and dictionaries can be mixed together e.g:

{
# more parameters up here
["dropout"]: [0, 0.1, 0.2, 0.3],
["optimiser"]: {"adam": 0.75, "sgd":0.25}, # equivalent to {"adam": 3, "sgd": 1} as weights are relative
["epochs"]: list(range(0,1000)),
# more parameters down here
}

get_all(): returns the search space used for random search in the paper
get_A(), get_B(), get_C(): returns configurations A, B, and C respectively as outlined in the paper
get_A_B_C(): returns configurations A, B, and C as values in a dictionary, keys are "A", "B", and "C"

mprhode / malware-prediction-rnn

readme