Open HuyTu7 opened 6 years ago
By using Zhe tool (FastRead) I tried to label the data proportionally for sanity checks and compare with the labeled data during the hackathon same hashes.
However, I did label the data until the buggy commit number is 90~95%+ of the estimated buggy commit number according to fastread. Therefore, it is safe to assume that the rest of undetermined bug or non-bug commits are non-buggy or non-buggy (with 5-10% chance of buggy commits) and just compare with all of the data being labeled during the hackathon, the result is much better.
By looking at these number, conclusions:
1/ with high duplication of a proportion of labeled data with low inconsistency, the data can be considered highly valid.
2/ some inconsistency between the hackathon data and the using Fastread to label data that can be concerning that can be due to a lot of the commits to fix dealing with fixing during writing papers, fixing documentation, etc that highly possible considered as bug fixes during the hackathon but not when during the sanity check.
Next steps: 1/ Looking to find some rules or NLP methods to summarize + automate the labeling process.
well done, you've confused your supervisor. you'll need to explain this to me next time we speak
predicted | Truth=0 | Truth=1 |
---|---|---|
predicted=n | ||
predicted=y |
abing:
predicted | Truth=0 | Truth=1 |
---|---|---|
predicted=n | 377 | 68 |
predicted=y | 190 | 491 |
undetermined | 3712 | 79 |
lammps:
predicted | Truth=0 | Truth=1 |
---|---|---|
predicted=n | 283 | 2 |
predicted=y | 499 | 74 |
undetermined | 6457 | 9 |
mdanalysis:
predicted | Truth=0 | Truth=1 |
---|---|---|
predicted=n | 480 | 19 |
predicted=y | 453 | 322 |
undetermined | 2013 | 16 |
libmesh:
predicted | Truth=0 | Truth=1 |
---|---|---|
predicted=n | 715 | 107 |
predicted=y | 701 | 698 |
undetermined | 6431 | 27 |
estimates for reading time for 10,000 projects
assumes 90 minutes to skim the issue reports
assuming 1.1 people reading each projects (so when there are disputes, can ask someone else)
well to be honest I'd double these costs
@HuyTu7 @azhe825 : there is something odd about the "A" numbers on the above tables. @azhe825 please get with @HuyTu7 and sort that out.
thanks!
@Huy Tu hqtu@ncsu.edu if you are convenient, please meet me in the lab tomorrow.
Best Regards,
Zhe, Ph.D. scholar @ CS, NcState http://azhe825.github.io http://azhe825@github.io
On Mon, Oct 1, 2018 at 1:52 PM Tim Menzies notifications@github.com wrote:
@HuyTu7 https://github.com/HuyTu7 @azhe825 https://github.com/azhe825 : there is something odd about the "A" numbers on the above tables. @azhe825 https://github.com/azhe825 please get with @HuyTu7 https://github.com/HuyTu7 and sort that out.
thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/se4sci/auto-labeler/issues/1#issuecomment-426000920, or mute the thread https://github.com/notifications/unsubscribe-auth/ANSK7fZxK3tSXI8ifNJF4f4xonxnhk0Nks5uglZEgaJpZM4W_0ED .
Will do, @timm and @azhe825 .
There was duplication in the labeling during the hackathon and the amount was high so I tried to find the duplication portion and how much are they consistent with each other between the attendees of the hackathon.