Figures below indicating the numerical improvements outside of statistical testing results from using F3T as a buggy commit identification and prediction system instead of the standard system of Commit.Guru. In this figure, the higher the vertical bars, the better the F3T performs in comparison to another learning method. Let X be the F3T score and Y is the score from another data mining method, then on this chart, the height of each bar is median X-Y seen across all tests in a project:
K: Keyword, S: SMOTE, SVM: Support Vector Machines, LR: Logistic Regression, RF: Random Forest
F3T performs as well, or better, that the other learners in 21/27 for both G-score and Popt(20).
RF is widely adopted in defect prediction task (as the first rank by Ghotra et al) but surprisingly, F3T usually performs much better than Keyword+SMOTE+RF (except 2 cases in Popt(20)).
When F3T performs comparatively better, it can do so by a large amount (up to 25% and 27% absolute improvement but those are equivalent up to 103% and 85% relative improvement for G-score and Popt(20)).
When F3T performs comparatively worse, the size of its loss is not large (see the left-hand-side negative vertical bars where F3T losses often by just 3% and 7% absolute loss or only 11% and 7% as relative losses).
Figures below indicating the numerical improvements outside of statistical testing results from using F3T as a buggy commit identification and prediction system instead of the standard system of Commit.Guru. In this figure, the higher the vertical bars, the better the F3T performs in comparison to another learning method. Let X be the F3T score and Y is the score from another data mining method, then on this chart, the height of each bar is median X-Y seen across all tests in a project: