PdfRep Dataset

Overview

The PdfRep dataset is a comprehensive collection of PDF files, compiled from various reliable sources to support research in areas such as malware analysis, document classification, and cybersecurity. We collected data from different resources. This dataset is used for the research purpose. To use this dataset, please cite our work:

R. Liu, R. Joyce, C. Matuszek and C. Nicholas, "Evaluating Representativeness in PDF Malware Datasets: A Comparative Study and a New Dataset," 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 2023, pp. 3017-3024, doi: 10.1109/BigData59044.2023.10386516.

Data Sources

The PdfRep dataset is an amalgamation of files from four distinct sources:

Contagio Dataset: A well-known repository of malware samples. Accessible at Contagio Blogspot.
CIC Dataset: This dataset includes a variety of malicious PDF files. Available for download on the CIC Dataset page.
VirusShare: A collection of malicious files. Our experience shows this collection can significantly improve the trained model performance: VirusShare Data
Govdocs: This dataset consists of benign files and is hosted by Digital Corpora. These files can be found at Digital Corpora.
Feature File: The extracted features can be downloaded on the Feature Data

Dataset Structure

The dataset includes a mix of benign and malicious PDF files, providing a diverse range of samples for analysis.

File References

For easy navigation and reference, users can consult the filename column in the feature_file.csv file. This column provides specific filenames included in the PdfRep dataset, facilitating straightforward identification and access to individual files. The corresponding features used in this research can also be found in it.

Usage

This dataset is intended for use in academic and research settings. Users are encouraged to utilize this data for research. If you encounter an error while using the pdfrw library to extract features, please try using this modified version instead: https://github.com/mzweilin/PDF-Malware-Parser

Acknowledgments

We acknowledge the contributions of the respective organizations and repositories that have made their data available, aiding in the creation of this comprehensive dataset.

Contact

For any inquiries or further information regarding the PdfRep dataset, please feel free to contact us.

thanlau / PdfRep

readme