sefcom / VarBERT

45 stars 5 forks source link

VarBERT

VarBERT is a BERT-based model which predicts meaningful variable names and variable origins in decompiled code. Leveraging the power of transfer learning, VarBERT can help you in software reverse engineering tasks. VarBERT is pre-trained on 5M human-written source code functions, and then it is fine-tuned on decompiled code from IDA and Ghidra, spanning four compiler optimizations (O0, O1, O2, O3). We built two data sets: (a) Human Source Code data set (HSC) and (b) VarCorpus (for IDA and Ghidra). This work is developed for IEEE S&P 2024 paper "Len or index or count, anything but v1": Predicting Variable Names in Decompilation Output with Transfer Learning

Key Features

Table of Contents

Overview

This repository contains details on generating a new dataset, and training and running inference on existing VarBERT models from the paper. To use VarBERT models in your day-to-day reverse engineering tasks, please refer to Use VarBERT.

VarBERT Model

We take inspiration for VARBERT from the concepts of transfer learning generally and specifically Bidirectional Encoder Representations from Transformers (BERT).

Use VarBERT

For a step-by-step guide and a demo on how to get started with the VarBERT API, please visit VarBERT API.

Training and Inference

For training a new model or running inference on existing models, see our detailed guide at Training VarBERT

Models available for download:

(A README containing all the necessary links for the model is also available.)

Data sets

Additionally, we have two splits: (a) Function Split (b) Binary Split.

Data sets available at:

The fine-tuned models and their corresponding datasets are named IDA-O0-Function and IDA-O0, respectively. This naming convention indicates that the models and data set are based on functions decompiled from O0 binaries using the IDA decompiler.

[!NOTE] Our existing data sets have been generated using IDA Pro 7.6 and Ghidra 10.4.

Installation

Prerequisites for training model or generating data set

Linux with Python 3.8 or higher
torch ≥ 1.9.0
transformers ≥ 4.10.0
pip install -r requirements.txt

# joern requires Java 11
sudo apt-get install openjdk-11-jdk

# Ghidra 10.4 requires Java 17+
sudo apt-get install openjdk-17-jdk

git clone git@github.com:rhelmot/dwarfwrite.git
cd dwarfwrite
pip install .

Note: Ensure you install the correct Java version required by your specific Ghidra version.

Citing

TODO