se4sci / auto-labeler

MIT License
0 stars 0 forks source link

Sanity check on hackathon labels data #1

Open HuyTu7 opened 6 years ago

HuyTu7 commented 6 years ago

There was duplication in the labeling during the hackathon and the amount was high so I tried to find the duplication portion and how much are they consistent with each other between the attendees of the hackathon.

HuyTu7 commented 6 years ago

By using Zhe tool (FastRead) I tried to label the data proportionally for sanity checks and compare with the labeled data during the hackathon same hashes.

However, I did label the data until the buggy commit number is 90~95%+ of the estimated buggy commit number according to fastread. Therefore, it is safe to assume that the rest of undetermined bug or non-bug commits are non-buggy or non-buggy (with 5-10% chance of buggy commits) and just compare with all of the data being labeled during the hackathon, the result is much better.

HuyTu7 commented 6 years ago

By looking at these number, conclusions: 1/ with high duplication of a proportion of labeled data with low inconsistency, the data can be considered highly valid.
2/ some inconsistency between the hackathon data and the using Fastread to label data that can be concerning that can be due to a lot of the commits to fix dealing with fixing during writing papers, fixing documentation, etc that highly possible considered as bug fixes during the hackathon but not when during the sanity check.

Next steps: 1/ Looking to find some rules or NLP methods to summarize + automate the labeling process.

timm commented 6 years ago

well done, you've confused your supervisor. you'll need to explain this to me next time we speak

timm commented 6 years ago
predicted Truth=0 Truth=1
predicted=n
predicted=y
timm commented 6 years ago

abing:

predicted Truth=0 Truth=1
predicted=n 377 68
predicted=y 190 491
undetermined 3712 79

lammps:

predicted Truth=0 Truth=1
predicted=n 283 2
predicted=y 499 74
undetermined 6457 9

mdanalysis:

predicted Truth=0 Truth=1
predicted=n 480 19
predicted=y 453 322
undetermined 2013 16

libmesh:

predicted Truth=0 Truth=1
predicted=n 715 107
predicted=y 701 698
undetermined 6431 27
timm commented 6 years ago

estimates for reading time for 10,000 projects

assumes 90 minutes to skim the issue reports

assuming 1.1 people reading each projects (so when there are disputes, can ask someone else)

well to be honest I'd double these costs

timm commented 6 years ago

@HuyTu7 @azhe825 : there is something odd about the "A" numbers on the above tables. @azhe825 please get with @HuyTu7 and sort that out.

thanks!

azhe825 commented 6 years ago

@Huy Tu hqtu@ncsu.edu if you are convenient, please meet me in the lab tomorrow.

Best Regards,

Zhe, Ph.D. scholar @ CS, NcState http://azhe825.github.io http://azhe825@github.io

On Mon, Oct 1, 2018 at 1:52 PM Tim Menzies notifications@github.com wrote:

@HuyTu7 https://github.com/HuyTu7 @azhe825 https://github.com/azhe825 : there is something odd about the "A" numbers on the above tables. @azhe825 https://github.com/azhe825 please get with @HuyTu7 https://github.com/HuyTu7 and sort that out.

thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/se4sci/auto-labeler/issues/1#issuecomment-426000920, or mute the thread https://github.com/notifications/unsubscribe-auth/ANSK7fZxK3tSXI8ifNJF4f4xonxnhk0Nks5uglZEgaJpZM4W_0ED .

HuyTu7 commented 6 years ago

Will do, @timm and @azhe825 .