vul337 / jTrans

Official code of jTrans: Jump-Aware Transformer for Binary Code Similarity Detection
MIT License
137 stars 14 forks source link

jTrans

This repo is the official code of jTrans: Jump-Aware Transformer for Binary Code Similarity Detection.

Illustrating the performance of the proposed jTrans

News

Writeups

Get Started

Prerequisites

Quick Start

a. Create a conda virtual environment and activate it.

conda create -n jtrans python=3.8 pandas tqdm -y
conda activate jtrans

b. Install PyTorch and other packages.

conda install pytorch cudatoolkit=11.0 -c pytorch
python -m pip install simpletransformers networkx pyelftools

c. Get code and models of jTrans.

git clone https://github.com/vul337/jTrans.git && cd jTrans

Download experiments.tar.gz and models.tar.gz and extract them.

tar -xzvf experiments.tar.gz && tar -xzvf models.tar.gz

d. Get the BinaryCorp dataset Download the processed dataset from this link

e. Finetune new models on the BinaryCorp

python finetune.py -h

d. Evaluation

python eval_save.py -h
python fasteval.py -h

try to evaluate jTrans on BinaryCorp-3M after extracting experiments.tar.gz

python fasteval.py

f. Try jTrans on your own binaries

Make sure you have IDA pro 7.5+ and following the instructions at datautils. After extracting features of your binaries, you can try jTrans on them such as the usage at eval_save.py.

Dataset

Acknowledgement

This project is not possible without multiple great open-sourced code bases. We list some notable examples below.

Bibtex

If this work or BinaryCorp dataset are helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{10.1145/3533767.3534367,
author = {Wang, Hao and Qu, Wenjie and Katz, Gilad and Zhu, Wenyu and Gao, Zeyu and Qiu, Han and Zhuge, Jianwei and Zhang, Chao},
title = {JTrans: Jump-Aware Transformer for Binary Code Similarity Detection},
year = {2022},
isbn = {9781450393799},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3533767.3534367},
doi = {10.1145/3533767.3534367},
abstract = {Binary code similarity detection (BCSD) has important applications in various fields such as vulnerabilities detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow information of binary code into Transformer-based language models, by using a novel jump-aware representation of the analyzed binaries and a newly-designed pre-training task. Additionally, we release to the community a newly-created large dataset of binaries, BinaryCorp, which is the most diverse to date. Evaluation results show that jTrans outperforms state-of-the-art (SOTA) approaches on this more challenging dataset by 30.5% (i.e., from 32.0% to 62.5%). In a real-world task of known vulnerability searching, jTrans achieves a recall that is 2X higher than existing SOTA baselines.},
booktitle = {Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis},
pages = {1–13},
numpages = {13},
keywords = {Binary Analysis, Similarity Detection, Neural Networks, Datasets},
location = {Virtual, South Korea},
series = {ISSTA 2022}
}

@article{wang2022jtrans,
  title={jTrans: Jump-Aware Transformer for Binary Code Similarity},
  author={Wang, Hao and Qu, Wenjie and Katz, Gilad and Zhu, Wenyu and Gao, Zeyu and Qiu, Han and Zhuge, Jianwei and Zhang, Chao},
  journal={arXiv preprint arXiv:2205.12713},
  year={2022}
}