FewVulnerability

Introduction

This is the repository of the paper "Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-TuningPre-Trained Language Models". In this paper, we design a system that leverages pretrained language model (PLM) to identify vulnerable software names and version in the public vulnerability reports.

In the sample vulnerability report shown below, our system will tag relevant tokens to SN and SV (vulnerable software names and versions) and others to O (outside).

Environment

Please follow the following steps to make sure the code is runnable.

Step 1: Create a conda environment and installed related libraries. Make sure the versions of Python, CUDA, CuDNN, and PyTorch are compatible with each other.

Note wandb is required for to log training process and metrics. Once an account is set up, information specified by the users will be synchronized with an online interactive dashboard (see here to get started).

conda create --name fewvul python=3.7
conda activate fewvul

# generic dependencies
conda install pandas==1.0.1 numpy==1.18.1 scikit-learn==0.23.1

# logging
pip install wandb

# hyperparameter search
pip install optuna

# transformers dependencies
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch
pip install transformers==4.5.1
pip install seqeval
pip install datasets

Step 2: Install code base in the conda environment. This will permanently resolve the relative import issues between different submodules (see here for a quick reference). After installation, you will see fewvul.egg-info in the project directory.
```
pip install -e .
```

Data

The data used for training is downloaded from here. It is provided in our repository (see dataset/ner_data) so no additional preparations are required for running our experiments.

Global Variables

All of the global variables are written in setting/setting.py. The following variables need to be set correctly to run the experiments.

Step 1: The path the entire project resides in.

base_path = pathlib.Path("/path/to/base/directory")

Step 2: The PLMs the repository currently supports include BERT, RoBERTa, and Electra. Choose from one of the following based on your hardware (bert-base-cased by default).

# BERT
model_name = "bert-base-cased"
# model_name = "bert-large-cased"

# RoBERTa
# model_name = "roberta-base"
# model_name = "roberta-large"

# Electra
# model_name = "google/electra-base-discriminator"
# model_name = "google/electra-large-discriminator"

Fine-Tuning Experiment

The transfer learning experiments are dependent on the fine-tuning experiments. Make sure the following steps are run before transfer learning experiments.

Step 1: Download the PLM checkpoint from the HuggingFace Model Hub. This will create a folder pretrained in the base directory and store the PLM locally for later use. The global variables specified in the previous section are used to make sure the correct model is downloaded.
```
python download.py
```
Step 2: Fine-tuning the downloaded checkpoint on the memc category. Running the following scripts will generate 10 checkpoints fine-tuned on 10% of the data sampled from memc training split.
```
cd fine_tuning
python run.py
```
All of the checkpoints will be saved in <base>/pretrained/<model_name>/checkpoints.

Transfer Learning Experiment

Make sure you have already obtained fine-tuned checkpoints which are stored in <base>/pretrained/<model_name>/checkpoints before transfer learning experiments.

cd transfer_learning
# on aggregate of 12 categories
python run_tl_agg.py
# on single category
python run_tl_single.py

guanqun-yang / FewVulnerability