vince-lam / web3-phishing-detection

An end-to-end pipeline for phishing message detection, encompassing model training and inference, along with a Flask application for user interaction, all containerized using Docker for easy deployment and execution.
MIT License
1 stars 0 forks source link

Web3 Phishing Detection

An end-to-end pipeline for phishing message detection, encompassing model training and inference, along with a Flask application for user interaction, all containerized using Docker for easy deployment and execution. Built in 2 days.

Background

The rapid evolution of blockchain technology and the rise of decentralized applications (dApps) have given birth to the Web3 ecosystem. While this technology promises greater control and ownership over digital assets, it has also introduced new security challenges. One of the most significant concerns is the prevalence of phishing attacks targeting Web3 users with the intent of obtaining their seed phrases and subsequently draining their wallets for fraudulent activities

Web3 phishing attacks are a critical issue in the modern blockchain landscape. These attacks involve the use of deceptive and fraudulent tactics to manipulate users into revealing their sensitive information, primarily their seed phrases. Seed phrases, also known as mnemonic phrases or recovery phrases, are sets of words that act as a cryptographic key to access and control users' wallets and assets within the Web3 ecosystem.

The core problem lies in the fact that users often lack awareness and knowledge about the security practices required to protect their seed phrases. Phishing attackers exploit this vulnerability by deploying convincing phishing messages through various communication channels, including email, social media, messaging apps, and even fake dApps. These messages typically prompt users to click on malicious links, enter their seed phrases on counterfeit websites, or share their confidential information.

Set Up

To run the Flask app:

  1. Install docker
  2. In the terminal run: docker run -p8888:8888 vincenthml/ml-app:1.1
  3. Go to http://127.0.0.1:8888 or http://172.17.0.2:8888 in a web browser

To run the code in this repo:

  1. Clone this repo and cd to the root directory
  2. Create a new virtual environment, I recommend using venv by running:
  1. Install poetry by running: pip install poetry
  2. Install python packages with poetry by running: poetry install

Now you can do the following:

  1. Download models from HuggingFace, train them on phishing dataset, save the new model and tokenizers
  1. Evaluate the newly trained model on f1, recall, precision, and accuracy
  1. Test the model predictions in the CLI

To create a new docker image so the Flask app can run an improved model:

  1. Make sure the best performing model and associated tokenizer files are saved in app/models and remove all other models
  2. cd app
  3. Update image parameter in docker-compose.yaml to correct username and project name
  4. Build the docker image: docker compose build up --build
  5. Push image to Docker Hub for reproducibility: docker compose push

Deliverables

Assumptions

Constraints

Due to limited time resource (1.5 days), the following constraints will be applied:

The following MLOps best practices will not be applied:

Roadmap

Model Selection

distilbert-base-uncased-finetuned-sst-2-english was chosen as a base model to be further fine tuned using custom web3 phishing dataset using a 80/20 train/test split.

The original dataset contained 30 duplicate pairs which were removed before training.

Metric results

The model trained using the deduplicated phishing data achieved the following metrics (the metrics of the model trained on the datasets with duplicates is shown within the parenthesis):

Repo Structure

.
├── LICENSE
├── README.md
├── app                                       # Directory for Flask app deployment
│   ├── Dockerfile
│   ├── app.py
│   ├── docker-compose.yaml
│   ├── models
│   │   ├── pytorch_model.bin
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer.json
│   │   ├── tokenizer_config.json
│   │   └── vocab.txt
│   ├── requirements.txt
│   ├── templates
│   │   ├── index.html
│   │   └── show.html
│   └── utils.py
├── ml_dev                                    # Directory for model development
│   ├── config.yaml
│   ├── data
│   │   ├── DS test_data.csv
│   │   └── DS test_data_deduped.csv
│   ├── generate_model_metrics.py
│   ├── logs
│   ├── model_outputs
│   ├── notebooks
│   │   └── exploration.ipynb
│   ├── requirements_poetry.txt
│   ├── test_predictions.py
│   ├── train_and_save_model.py
│   └── utilities.py
├── poetry.lock
├── pyproject.toml
└── tests
    └── ml_dev
        └── test_generate_model_metrics.py

Future Work

With more time, the following features could be explored or implemented:

License

For this github repository, the License used is MIT License.