guanqun-yang / FewVulnerability

Detecting software vulnerability described public security reports with minimal training examples
13 stars 2 forks source link

FewVulnerability

Introduction

This is the repository of the paper "Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-TuningPre-Trained Language Models". In this paper, we design a system that leverages pretrained language model (PLM) to identify vulnerable software names and version in the public vulnerability reports.

In the sample vulnerability report shown below, our system will tag relevant tokens to SN and SV (vulnerable software names and versions) and others to O (outside).

Environment

Please follow the following steps to make sure the code is runnable.

Data

The data used for training is downloaded from here. It is provided in our repository (see dataset/ner_data) so no additional preparations are required for running our experiments.

Global Variables

All of the global variables are written in setting/setting.py. The following variables need to be set correctly to run the experiments.

Fine-Tuning Experiment

The transfer learning experiments are dependent on the fine-tuning experiments. Make sure the following steps are run before transfer learning experiments.

Transfer Learning Experiment

Make sure you have already obtained fine-tuned checkpoints which are stored in <base>/pretrained/<model_name>/checkpoints before transfer learning experiments.

cd transfer_learning
# on aggregate of 12 categories
python run_tl_agg.py
# on single category
python run_tl_single.py