Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction

vivekaxl commented 9 years ago

Paper: Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction Data: https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/ Contact Information - Dr. Andreas Zeller zeller@cs.uni-saarland.de

CarterPape commented 9 years ago

Data download link: https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/promise-2_0a-xml.zip

Context notes:

What is the data format?

The provided XML files contain the defect data collected from the eclipse bug database and version archive and are separated according to the eclipse versions. The coarse structure of the XML files is described in the companion paper "If Your Bug Database Could Talk...":

Packages are structured hierarchically, which means subpackages are nodes with their super packages as parent node in the tree. The following example describes the tree structure:

            <package name="org.eclipse">
              <counts> ... </counts>
              <compilationunit ...> ... </compilationunit>
              <package name="org.eclipse.core">
                ...
              </package>
            </package>
Here's an extract from the file eclipse-defects-version-2.0.xml:
          <?xml version="1.0" encoding="UTF-8"?>
          <!-- comments -->
          <defects project="eclipse" release="2.0" dataversion="1.0">
          <plug-in name="platform-launcher">
            <counts>
              <count id="pre" value="0" avg="0.0" compilationunits="1" max="0"/>
              <count id="post" value="0" avg="0.0" compilationunits="1" max="0"/>
            </counts>
            <package name="org.eclipse">
              <counts>
                <count id="pre" value="0" avg="0.0" compilationunits="1" max="0"/>
                <count id="post" value="0" avg="0.0" compilationunits="1" max="0"/>
              </counts>
              <package name="org.eclipse.core">
                <counts>
                  <count id="pre" value="0" avg="0.0" compilationunits="1" max="0"/>
                  <count id="post" value="0" avg="0.0" compilationunits="1" max="0"/>
                </counts>
                <package name="org.eclipse.core.launcher">
                  <counts>
                    <count id="pre" value="0" avg="0.0" compilationunits="1" max="0"/>
                    <count id="post" value="0" avg="0.0" compilationunits="1" max="0"/>
                  </counts>
                  <compilationunit dir="/platform-launcher/library/" base="Main.java">
                    <counts>
                      <count id="pre" value="0"/>
                      <count id="post" value="0"/>
                    </counts>
                  </compilationunit>
                </package>
              </package>
            </package>
          </plug-in>

Defect counts are listed as count at the plug-in, package and compilationunit levels. The value field contains the actual number of pre- ("pre") and post-release defects ("post"). The average ("avg") and maximum ("max") values refer to the defects found in the compilation units ("compilationunits"). Each compilation unit is listed separately ("compilationunit") within the enclosing package. The average ("avg") is the average number of defects per compilation unit.

vivekaxl commented 9 years ago

_1. Data source: _ Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction

_2. Link to the material associated with the dataset (if available): _ https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/promise-2_0a-xml.zip

3. Attribution list for the material Include at least one contact email Huihua Lu, Ekrem Kocaguneli and Bojan Cukic hlu3@mix.wvu.edu, kocaguneli@gmail.com and bojan.cukic@mail.wvu.edu

though this data was used in this paper, the data should be downloaded from Saarland University website. BTW this is publically available data and we shouldn't need to write an email to anyone.

4. BibTeX reference @article{ludefect, title={Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction}, author={Lu, Huihua and Kocaguneli, Ekrem and Cukic, Bojan} } 5. Link to the datasets https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/promise-2_0a-xml.zip It is ~15MB after uncompress

6. PROMISE repo category (effort, requirements, model, defect, etc.) Defect

7. General overview of the data

Release packages/files % with defects metrics 2.0 377 / 6729 50.4% / 14.5% 41 / 32 2.1 434 / 7888 4 4.7% / 10.8% 41 / 32 3.0 661 / 10593 47.4% /14.8% 41 / 32

Table I describes the defect content in three successive releases of Eclipse, 2.0, 2.1, 3.0, at two levels of granularity: files and packages. The data set has been aggregated from the release archives and a bug repository. The release archives, CVS and more recently GIT, provide entries related to the commit history of a system. In Bugzilla, one can map each bug report to a release. For classification purposes, we divide packages/files into two classes - those with and without defects. The complexity metrics for each package/file can be computed

from the archived builds of Eclipse. We utilized the same complexity metrics used in [15].

8. Attribute info <!DOCTYPE defects [ <!ELEMENT defects (plug-in)+> <!ATTLIST defects project CDATA #REQUIRED> <!ATTLIST defects release CDATA #REQUIRED> <!ATTLIST defects dataversion CDATA #REQUIRED> <!ELEMENT plug-in (compilationunit, counts, package)> <!ATTLIST plug-in name CDATA #REQUIRED> <!ELEMENT package (compilationunit, counts, package)> <!ATTLIST package name CDATA #REQUIRED> <!ELEMENT counts (count , count)> <!ELEMENT count EMPTY> <!ATTLIST count id CDATA #REQUIRED> <!ATTLIST count value CDATA #REQUIRED> <!ATTLIST count avg CDATA #IMPLIED> <!ATTLIST count compilationunits CDATA #IMPLIED> <!ATTLIST count max CDATA #IMPLIED> <!ELEMENT compilationunit (counts , fix*)> <!ATTLIST compilationunit dir CDATA #REQUIRED> <!ATTLIST compilationunit base CDATA #REQUIRED> <!ATTLIST compilationunit filename CDATA #REQUIRED> <!ELEMENT fix (message)> <!ATTLIST fix kind CDATA #REQUIRED> <!ATTLIST fix bug_id CDATA #REQUIRED> <!ATTLIST fix revision_id CDATA #REQUIRED> <!ATTLIST fix author CDATA #REQUIRED> <!ELEMENT message (#PCDATA)> ]>

9. Paper abstract (if appropriate)

—Accurate detection of defects prior to product release helps software engineers focus verification activities on defect prone modules, thus improving the effectiveness of software development. A common scenario is to use the defects from prior releases to build the prediction model for the upcoming release, typically through a supervised learning method. As software development is a dynamic process, fault characteristics in subsequent releases may vary. Therefore, supplementing the defect information from prior releases with limited information about the defects from the current release detected early, seems to offer intuitive and practical benefits. We propose active learning as a way to automate the development of models which improve the performance of defect prediction between successive releases. Our results show that the integration of active learning with uncertainty sampling consistently outperforms the corresponding supervised learning approach. We further improve the prediction performance with feature compression techniques, where feature selection or dimensionality reduction is applied to defect data prior to active learning. We observe that dimensionality reduction techniques, particularly multidimensional scaling with random forest similarity, work better than feature selection due to their ability to identify and combine essential information in data set features. We present the improvements offered by this methodology through the prediction of defective modules in the three successive versions of Eclipse.

10. Is this dataset part of a larger series or collection? Several versions of Eclipse data sets have been in use to study defect prediction [27], [28]. In our study, we use the Eclipse data sets introduced by Zimmerman et al. [15], which are publicly available.

reesjones commented 9 years ago

Added as huihuaprediction.

opensciences / opensciences.github.io

Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction #147