This README provides an overview of our project, which aims to build a classifier to distinguish network hosts as either human, bot or a combination of both. The document outlines the project's objectives, datasets used, available commands and an overview of the project's structure.
For your information, the project has been tested successfully on a Linux machine with python version 3.8+.
Build a classifier to classify each host from the dataset as either : 1) human; 2) bot; 3) human+bot;
and report our methodology, results and analysis.
We are free to choose the method to classify but we need to justify our choice of method.
As specified in the project statement, we tried to follow as much as possible the pep0008 style guide.
⚠️⚠️⚠️ Detailed information, methodology, results and analysis can be found in the project report.
The project focuses only on data from a public resolver with the following details :
1.1.1.1
(one.one.one.one
)53
The training data includes two datasets :
webclients_tcpdump.txt
: Contains DNS traces from 120 UNamur hosts browsing various top-1000 Alexa-listed websites.bots_tcpdump.txt
: Contains DNS traces from 120 bots, also interacting with top-1000 Alexa-listed websites.They both contain human and bot traffic.
eval1_tcpdump.txt
: bots and humans are guaranteed to be a separate set of hosts. eval2_tcpdump.txt
: some hosts emit traffic from a human and from a bot.For evaluation, two lists of known bot hosts are used :
eval1_botlist.txt
eval2_botlist.txt
When running the scripts, we use colors to differentiate the different steps of the process :
Green : corresponds to the pre-processing steps.
Blue : corresponds to the training phase.
Red : corresponds to the evaluation phase.
Yellow : corresponds to saving and loading the model.
Purple : corresponds to every data related to classification and accuracy.
We also use light colors to differentiate the different rates during the classification :
Do not forget to install the requirements before running the scripts ! You can do it by running the following command : First, create a virtual environment and activate it :
python3 -m venv .venv && source .venv/bin/activate
Finally, install the requirements :
pip3 install -r requirements.txt
First of all, navigate to the scripts
folder :
cd scripts
In the next commands, <algo>
can be replaced by decision_tree
, logistic_regression
, neural_networks
, random_forest
or knn
. It is not a mandatory argument for train.py
, but if you do not specify it, the default algorithm will be logistic_regression
. We made it mandatory for main.py
because this script is not required by the project statement.
To train the model, run the following command :
python3 train.py \
--webclients ../training_datasets/tcpdumps/webclients_tcpdump.txt \
--bots ../training_datasets/tcpdumps/bots_tcpdump.txt \
--algo <algo> \
--output ../trained_models/<algo>/trained_model_<algo>.pkl
For example, you could use :
python3 train.py \
--webclients ../training_datasets/tcpdumps/webclients_tcpdump.txt \
--bots ../training_datasets/tcpdumps/bots_tcpdump.txt \
--algo logistic_regression \
--output ../trained_models/logistic_regression/trained_model_logistic_regression.pkl
To evaluate the model, run the following command :
python3 eval.py \
--trained_model ../trained_models/<algo>/trained_model_<algo>.pkl \
--dataset ../evaluation_datasets/tcpdumps/eval1_tcpdump.txt \
--output ../suspicious_hosts/suspicious_hosts.txt
For example, you could use :
python3 eval.py \
--trained_model ../trained_models/logistic_regression/trained_model_logistic_regression.pkl \
--dataset ../evaluation_datasets/tcpdumps/eval1_tcpdump.txt \
--output ../suspicious_hosts/suspicious_hosts.txt
To do both at the same time, run the following command :
python3 main.py \
--webclients ../training_datasets/tcpdumps/webclients_tcpdump.txt \
--bots ../training_datasets/tcpdumps/bots_tcpdump.txt \
--algo <algo> \
--trained_model ../trained_models/<algo>/trained_model_<algo>.pkl \
--dataset ../evaluation_datasets/tcpdumps/eval1_tcpdump.txt \
--output ../suspicious_hosts/suspicious_hosts.txt
For example, you could use :
python3 main.py \
--webclients ../training_datasets/tcpdumps/webclients_tcpdump.txt \
--bots ../training_datasets/tcpdumps/bots_tcpdump.txt \
--algo logistic_regression \
--trained_model ../trained_models/logistic_regression/trained_model_logistic_regression.pkl \
--dataset ../evaluation_datasets/tcpdumps/eval1_tcpdump.txt \
--output ../suspicious_hosts/suspicious_hosts.txt
First, starting for the root directory of the project, navigate to the scripts/utils/diagrams/
directory :
cd scripts/utils/diagrams/
To create plots related to algorithms, run the following command :
python3 diagrams_algo.py
To create plots related to metrics, run the following command :
python3 diagrams_metrics.py
diagrams
: contains the diagrams produced for in the report.evaluation_datasets
: contains the evaluation datasets given by the professor and the botlists.report
: contains the report of the project.
results
: contains several files with the results of the different algorithms/features (not all of them of course).scripts
: contains the scripts used to train and evaluate the models.
features
: contains the 3 scripts (time, misc and numbers) used to create the new features based on the aggregated raw features.utils
: colors.py
: contains the colors used in the scripts.constants.py
: contains the constants used in the scripts.features.py
: contains the functions used to orchestrate everything related to the features.parsing_dns_trace.py
: contains the functions used to parse the DNS traces.saving_and_loading.py
: contains the functions used to save and load the models.diagrams
:
diagram_algo.py
: contains the functions used to create the diagrams related to the algorithms.diagram_metrics.py
: contains the functions used to create the diagrams related to the metrics. eval.py
: contains the functions used to evaluate the models.main.py
: contains the functions used to both do the training and evaluation of the models.train.py
: contains the functions used to train the models.suspicious_hosts
: contains the suspicious hosts found by the models.trained_models
: contains the trained models. There is a subdirectory for each algorithm.training_datasets
: contains the training datasets given by the professor.