stg-tud / MUBench

Other
53 stars 30 forks source link

Number of dataset configuration #428

Closed xgdsmileboy closed 5 years ago

xgdsmileboy commented 5 years ago

Hello, this is an excellent project and I am very interested in the project. But, after reading the online ReadMe and your TSE paper, I am confused about the number of misuses for different experiments. As described in the ReadMe

Experiment R (recall)
Dataset TSE17-ExRecall contains 53 misuses (all from 29 versions of 13 projects, no hand-crafted examples)'

there should be 53 misuses. But there are exactly only 39 misuses (from line 572 to line 611). As presented in your paper, the "Experiment R" also considers those detected true positives by existing detectors, which I think should be those under TSE17-ExPrecision-TruePositives. However, the number of misuses plus these two dataset (i.e., TSE17-ExRecall and TSE17-ExPrecision-TruePositives) should be 58.

Thus, it is so confused about the number of misuses. Could you please kindly help whether I understand it correctly? Thanks.

salsolatragus commented 5 years ago

Dear Jiajun Jiang,

thank you very much for the feedback. Your understanding of the dataset is correct and there is in fact a discrepancy between the number of misuses reported in the paper and the number of misuses in the dataset. When we conducted the experiment for the TSE, we did not include the following 5 misuses (thus the numbers: 58 - 5 = 53):

This is because they are additional instances of the same misuse (same mistake in using the same API in the same method). Since all detectors in the TSE experiments report at most one instance of a particular misuse per method, we included only one instance each in the dataset (itext.5091.dmmc-16 and lucene.1918.tikanga-1).

In our MSR'19 paper we present MUDetect, a detector that may report multiple instances of the same misuse in a method. We found that we cannot fairly compare all detectors, if we do not distinguish multiple instances of the same misuse within a method. Therefore, we added the additional instances after the fact. In that sense, we didn't change the datasets, but corrected our interpretation of the detector results.

For the MSR'19 experiments, we counted all instances of these misuses that appeared in the top 20 findings as a true positives (Experiment P). For the detectors that report at most one instance of a misuse per methods, we conservatively counted a hit for all instances of the misuse (Experiment R), while for MUDetect, we only counted the exact hits.

I hope this clarifies your confusion. Please feel free to ask further questions! Best, Sven

xgdsmileboy commented 5 years ago

Hi, Sven, thanks for your detailedly reply, it helps a lot. Sincerely, Jiajun