Open rahulpandita opened 9 years ago
The authors say in the introduction that they will make available "a novel, hand-curated dataset of vulnera- bilities in PHP web applications that will be offered to the community." Is this the sort of thing we care about? Or do we care about the input dataset all with vulnerability and defect metrics? Perhaps both? Is the former too qualitative to be useful to us?
They make available "a raw dataset and a replication dataset. Most users will be interested in the raw dataset, which contains the complete vulnerability and feature data for each application version. A subset of the raw dataset was extracted and reformatted into a replication dataset, which contains the data and scripts required to replicate the results of the study." Which one do we take? Both? Is it dependent on size, or perhaps something else?
Should we copy the source code of every version of PHPMyAdmin and Moodle to the big repo? They add up to 519 MB total, but the total size of all the other relevant data is less than 30 MB.
Am i reading this right?
if yes and yes then download the RAW and leave the replication.
Predicting Vulnerable Components: Software Metrics vs Text Mining
Link to paper : http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6982351&queryText%3DPredicting+Vulnerable+Components%3A+Software+Metrics+vs+Text+Mining
Link to data : http://seam.cs.umd.edu/webvuldata/
James Walden, Jeffrey Stuckman and Riccardo Scandariato pages: 23-33 doi>10.1109/ISSRE.2014.32 In this paper, we provide a high-quality, public dataset, containing 223 vulnerabilities found in three web applications, to help address this issue. We used this dataset to compare vulnerability prediction models based on text mining with models using software metrics as predictors.