Predicting Vulnerable Components: Software Metrics vs Text Mining

opensciences / opensciences.github.io

Website for OpenScience -

http://openscience.us

MIT License

26 stars 18 forks source link

Predicting Vulnerable Components: Software Metrics vs Text Mining #151

Open rahulpandita opened 9 years ago

rahulpandita commented 9 years ago

Predicting Vulnerable Components: Software Metrics vs Text Mining

Link to paper : http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6982351&queryText%3DPredicting+Vulnerable+Components%3A+Software+Metrics+vs+Text+Mining

Link to data : http://seam.cs.umd.edu/webvuldata/

James Walden, Jeffrey Stuckman and Riccardo Scandariato pages: 23-33 doi>10.1109/ISSRE.2014.32 In this paper, we provide a high-quality, public dataset, containing 223 vulnerabilities found in three web applications, to help address this issue. We used this dataset to compare vulnerability prediction models based on text mining with models using software metrics as predictors.

CarterPape commented 9 years ago

The authors say in the introduction that they will make available "a novel, hand-curated dataset of vulnera- bilities in PHP web applications that will be offered to the community." Is this the sort of thing we care about? Or do we care about the input dataset all with vulnerability and defect metrics? Perhaps both? Is the former too qualitative to be useful to us?

CarterPape commented 9 years ago

They make available "a raw dataset and a replication dataset. Most users will be interested in the raw dataset, which contains the complete vulnerability and feature data for each application version. A subset of the raw dataset was extracted and reformatted into a replication dataset, which contains the data and scripts required to replicate the results of the study." Which one do we take? Both? Is it dependent on size, or perhaps something else?

CarterPape commented 9 years ago

Should we copy the source code of every version of PHPMyAdmin and Moodle to the big repo? They add up to 519 MB total, but the total size of all the other relevant data is less than 30 MB.

timm commented 9 years ago

Am i reading this right?

the RAW data sets are all < 10MB each
the replication datasets are 200 to 400MB each and there are many of them

if yes and yes then download the RAW and leave the replication.