Improving the Accuracy of Duplicate Bug Report Detection using Textual Similarity Measures

bigfatnoob commented 9 years ago

By: Alina Lazar, Sarah Ritchey, Bonita Sharif Department of Computer Science and Information Systems Youngstown State University Youngstown, Ohio USA 44555 alazar@ysu.edu, sritchey@student.ysu.edu, bsharif@ysu.edu

Paper Src: http://dl.acm.org/citation.cfm?id=2597088

Data Src: 3 Bug report Datasets 1) Eclipse 2008 2) Open Office 3) Mozilla

bigfatnoob commented 9 years ago

Data Src Link: http://www.csis.ysu.edu/~alazar/msr14/

Author emails available in the previous comment. Documentation on the paper is available with me. Can give it on request

bigfatnoob commented 9 years ago

Paper: Improving the Accuracy of Duplicate Bug Report Detectionusing Textual Similarity Measures

Authors: Alina Lazar, Sarah Ritchey, Bonita Sharif Department of Computer Science and Information Systems Youngstown State University Youngstown, Ohio USA 44555 alazar@ysu.edu, sritchey@student.ysu.edu, bsharif@ysu.edu

Dataset Acquisition: The three systems used in the case study presented here are Eclipse, Open Office, and Mozilla. Using web scraping techniques, bug reports were collected from the Bugzilla websites of the three systems. In Table 1 we report: the time interval when the bugs collected were submitted, the number of initial bugs collected, the number of initial duplicates, the number of bugs after preprocessing, and thenumber of duplicate pairs used for training. After the bug reports were collected, reports that had an open resolution were removed from the datasets. We did this because their status cannot be confirmed from the information available. Removing them would help prevent training the model with mislabeled data. It is also important to note that this could easily be done in an industrial setting. Thedatasets were trimmed further by removing duplicate resolution bugs that did not have masters in the dataset. For each dataset, the master duplicate bug was found together with all the duplicate bugs in it’s group. This was just a preliminary step before generating all the duplicate bug report pairs. All these pairs are identified by adding to the dataset the classification decision feature decision and setting its value to 1. Four times as many non-duplicate pairs of bugs as duplicate pairs of bugs were generated for the training dataset. The decision for the non-duplicate pairs is set to -1.

The unprocessed data can be refered from

“C. Sun, D. Lo, S. Khoo, and J. Jiang. Towards more accurate retrieval of duplicate bug reports. Proc. Of the 26th IEEE/ACM ASE, pages 253–262, 2011.”

Classification features selected(Not in reports): bug id, title, description, product, component, type, priority, version, open date 18 Synthezised Features are generated using simple TakeLab . A few of them in no particular order are 1-3) n-gram word overlap for unigrams, bigrams and trigrams, 4-6) n-gram word overlap for unigrams, bigrams and trigrams after lemmatization, 7) WordNet based augmented word overlap, 8) weighted word overlap, 9-10) normalized differences for sentence length and aggregate word information content, 11) shallow named entity and numbers overlap.

FYI: The paper or dataset does not have a detailed description of the attributes or its sequence in the dataset, but it says that it is referenced from “http://www.aclweb.org/anthology/S12-1060”

Features 19-25 are calculated from the paper “Towards more accurate retrieval of duplicate bug reports”

feature19 = 1, if b1.prod = b2.prod 0, otherwise feature20 = 1, if b1.comp = b2.comp 0, otherwise feature21 = 1, if b1.type = b2.type 0, otherwise feature22 = 1/(1 − abs(b1.priority − b2.priority)) feature23 = 1/(1 − abs(b1.version − b2.version)) feature24 = abs(b1.open date − b2.open date) feature25 = feature25 = abs(b1.bug id − b2.bug id)

Train and Test Split of Dataset.

After we generate the features for all the bugs in the datasets, we divide the initial dataset into a training set and a testing set. The training set contains 5,000 bug pairs and all the other pairs are put into the testing set. The instances are divided into the subsets using stratified sampling, so the percentage between the two classes is preserved. That means out of the 5,000 pairs, 1,000 paris are duplicate and 4,000 are non-duplicate. Next, all the values in the training and the testing datasets are scaled to the [-1,1] interval. First the training set is scaled and the ranges are then applied to scale the testing set. The training sets are used to generate the classification models

CarterPape commented 9 years ago

Added as bugsal

http://openscience.us/repo/issues/bugsal.html

larissaalmeida6 commented 6 years ago

I guess this Data Src Link is not available anymore: http://www.csis.ysu.edu/~alazar/msr14/

Anyone can help me?

opensciences / opensciences.github.io

Improving the Accuracy of Duplicate Bug Report Detection using Textual Similarity Measures #160