maluyckx / IDS-ML

Flagging suspicious hosts from DNS traces using machine-learning. 20/20
1 stars 0 forks source link

ICYBM201 : Flagging suspicious hosts from DNS traces

Grade : 20/20

Authors and ULB matricules

Introduction

This README provides an overview of our project, which aims to build a classifier to distinguish network hosts as either human, bot or a combination of both. The document outlines the project's objectives, datasets used, available commands and an overview of the project's structure.

For your information, the project has been tested successfully on a Linux machine with python version 3.8+.

Goal

Build a classifier to classify each host from the dataset as either : 1) human; 2) bot; 3) human+bot;

and report our methodology, results and analysis.

We are free to choose the method to classify but we need to justify our choice of method.

As specified in the project statement, we tried to follow as much as possible the pep0008 style guide.

Report

⚠️⚠️⚠️ Detailed information, methodology, results and analysis can be found in the project report.

Datasets

Public resolver data

The project focuses only on data from a public resolver with the following details :

Training datasets

The training data includes two datasets :

  1. webclients_tcpdump.txt : Contains DNS traces from 120 UNamur hosts browsing various top-1000 Alexa-listed websites.
  2. bots_tcpdump.txt : Contains DNS traces from 120 bots, also interacting with top-1000 Alexa-listed websites.

Evaluation datasets

They both contain human and bot traffic.

  1. eval1_tcpdump.txt : bots and humans are guaranteed to be a separate set of hosts.
  2. eval2_tcpdump.txt : some hosts emit traffic from a human and from a bot.

Lists of bots

For evaluation, two lists of known bot hosts are used :

Colors

When running the scripts, we use colors to differentiate the different steps of the process :

Commands

Do not forget to install the requirements before running the scripts ! You can do it by running the following command : First, create a virtual environment and activate it :

python3 -m venv .venv && source .venv/bin/activate

Finally, install the requirements :

pip3 install -r requirements.txt

First of all, navigate to the scripts folder :

cd scripts

In the next commands, <algo> can be replaced by decision_tree, logistic_regression, neural_networks, random_forest or knn. It is not a mandatory argument for train.py, but if you do not specify it, the default algorithm will be logistic_regression. We made it mandatory for main.py because this script is not required by the project statement.


To train the model, run the following command :

python3 train.py \
--webclients ../training_datasets/tcpdumps/webclients_tcpdump.txt \
--bots ../training_datasets/tcpdumps/bots_tcpdump.txt \
--algo <algo> \
--output ../trained_models/<algo>/trained_model_<algo>.pkl

For example, you could use :

python3 train.py \
--webclients ../training_datasets/tcpdumps/webclients_tcpdump.txt \
--bots ../training_datasets/tcpdumps/bots_tcpdump.txt \
--algo logistic_regression \
--output ../trained_models/logistic_regression/trained_model_logistic_regression.pkl

To evaluate the model, run the following command :

python3 eval.py \
--trained_model ../trained_models/<algo>/trained_model_<algo>.pkl \
--dataset ../evaluation_datasets/tcpdumps/eval1_tcpdump.txt \
--output ../suspicious_hosts/suspicious_hosts.txt

For example, you could use :

python3 eval.py \
--trained_model ../trained_models/logistic_regression/trained_model_logistic_regression.pkl \
--dataset ../evaluation_datasets/tcpdumps/eval1_tcpdump.txt \
--output ../suspicious_hosts/suspicious_hosts.txt

To do both at the same time, run the following command :

python3 main.py \
--webclients ../training_datasets/tcpdumps/webclients_tcpdump.txt \
--bots ../training_datasets/tcpdumps/bots_tcpdump.txt \
--algo <algo> \
--trained_model ../trained_models/<algo>/trained_model_<algo>.pkl \
--dataset ../evaluation_datasets/tcpdumps/eval1_tcpdump.txt \
--output ../suspicious_hosts/suspicious_hosts.txt 

For example, you could use :

python3 main.py \
--webclients ../training_datasets/tcpdumps/webclients_tcpdump.txt \
--bots ../training_datasets/tcpdumps/bots_tcpdump.txt \
--algo logistic_regression \
--trained_model ../trained_models/logistic_regression/trained_model_logistic_regression.pkl \
--dataset ../evaluation_datasets/tcpdumps/eval1_tcpdump.txt \
--output ../suspicious_hosts/suspicious_hosts.txt 

Diagrams

First, starting for the root directory of the project, navigate to the scripts/utils/diagrams/ directory :

cd scripts/utils/diagrams/

To create plots related to algorithms, run the following command :

python3 diagrams_algo.py

To create plots related to metrics, run the following command :

python3 diagrams_metrics.py

Structure of the project