This is the repository of the paper "Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-TuningPre-Trained Language Models". In this paper, we design a system that leverages pretrained language model (PLM) to identify vulnerable software names and version in the public vulnerability reports.
In the sample vulnerability report shown below, our system will tag relevant tokens to SN
and SV
(vulnerable software names and versions) and others to O
(outside).
Please follow the following steps to make sure the code is runnable.
Step 1: Create a conda
environment and installed related libraries. Make sure the versions of Python, CUDA, CuDNN, and PyTorch are compatible with each other.
Note wandb
is required for to log training process and metrics. Once an account is set up, information specified by the users will be synchronized with an online interactive dashboard (see here to get started).
conda create --name fewvul python=3.7
conda activate fewvul
# generic dependencies
conda install pandas==1.0.1 numpy==1.18.1 scikit-learn==0.23.1
# logging
pip install wandb
# hyperparameter search
pip install optuna
# transformers dependencies
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch
pip install transformers==4.5.1
pip install seqeval
pip install datasets
Step 2: Install code base in the conda
environment. This will permanently resolve the relative import issues between different submodules (see here for a quick reference). After installation, you will see fewvul.egg-info
in the project directory.
pip install -e .
The data used for training is downloaded from here. It is provided in our repository (see dataset/ner_data
) so no additional preparations are required for running our experiments.
All of the global variables are written in setting/setting.py
. The following variables need to be set correctly to run the experiments.
Step 1: The path the entire project resides in.
base_path = pathlib.Path("/path/to/base/directory")
Step 2: The PLMs the repository currently supports include BERT, RoBERTa, and Electra. Choose from one of the following based on your hardware (bert-base-cased
by default).
# BERT
model_name = "bert-base-cased"
# model_name = "bert-large-cased"
# RoBERTa
# model_name = "roberta-base"
# model_name = "roberta-large"
# Electra
# model_name = "google/electra-base-discriminator"
# model_name = "google/electra-large-discriminator"
The transfer learning experiments are dependent on the fine-tuning experiments. Make sure the following steps are run before transfer learning experiments.
Step 1: Download the PLM checkpoint from the HuggingFace Model Hub. This will create a folder pretrained
in the base directory and store the PLM locally for later use. The global variables specified in the previous section are used to make sure the correct model is downloaded.
python download.py
Step 2: Fine-tuning the downloaded checkpoint on the memc
category. Running the following scripts will generate 10 checkpoints fine-tuned on 10% of the data sampled from memc
training split.
cd fine_tuning
python run.py
All of the checkpoints will be saved in <base>/pretrained/<model_name>/checkpoints
.
Make sure you have already obtained fine-tuned checkpoints which are stored in <base>/pretrained/<model_name>/checkpoints
before transfer learning experiments.
cd transfer_learning
# on aggregate of 12 categories
python run_tl_agg.py
# on single category
python run_tl_single.py