Consider the feature space “F” (11 features) to be the following:
Flow IAT Min
Avg Packet Size
Subflow F.Bytes
Flow Duration
Total Len F.Packets
Active Min
Active Mean
Init Win F.Bytes
PSH Flag Count
SYN Flag Count
ACK Flag Count
In case you are not able to find all these features, consider only the common features.
Train a vanilla binary decision tree on the dataset (mentioned in 1) on the above-stated features only. Note accuracy, recall, and precision (TP is a number of correctly classified malicious flows) and F1 score.
Find the minimal feature subset F`(max 4 features) of “F” which gives almost the same accuracy, recall, precision, and F1 score.
Capture the decision tree rules for only the malicious flows (R) and make a note of them.
System Analysis Phase
Let me know when you get to this point.
We then need to test the current system with this new dataset and new decision tree rules. The code (in python3) is already there, we just need to update it with new decision tree rules. Test on the current system to get a recall, precision, accuracy, and F1 score for different values of hash table entries.
In case of very poor performance, we will need to update the system design.
Once this is done, I can create a new issue accordingly
Information Gather Phase
ML Phase
In order to download the dataset, visit this repository as mentioned in the FlowLens paper.
Follow these steps:
In case you are not able to find all these features, consider only the common features.
Train a vanilla binary decision tree on the dataset (mentioned in 1) on the above-stated features only. Note accuracy, recall, and precision (TP is a number of correctly classified malicious flows) and F1 score.
Find the minimal feature subset F`(max 4 features) of “F” which gives almost the same accuracy, recall, precision, and F1 score.
Make sure that the trained decision tree predicts class probabilities as well (https://stats.stackexchange.com/questions/193424/is-decision-tree-output-a-prediction-or-class-probabilities). Make note of the probabilities along with class labels.
Capture the decision tree rules for only the malicious flows (R) and make a note of them.
System Analysis Phase
Let me know when you get to this point.
Once this is done, I can create a new issue accordingly