Sanity check on hackathon labels data

HuyTu7 commented 6 years ago

There was duplication in the labeling during the hackathon and the amount was high so I tried to find the duplication portion and how much are they consistent with each other between the attendees of the hackathon.

'abinit', 51.46% duplication with 5.826% different in labeling (5329 labeled commits originally)
'lammps', 72.483% duplication with 1.12% different in labeling (7324 labeled commits originally)
'mdanalysis', 40.131% duplication with 10.253% different in labeling (3303 labeled commits originally)
'libmesh', 1.16% duplication, 2.94% (8679 labeled commits originally)

HuyTu7 commented 6 years ago

By using Zhe tool (FastRead) I tried to label the data proportionally for sanity checks and compare with the labeled data during the hackathon same hashes.

'abinit', 1121 entries recorded, 23.01% different with the labeled data
'lammps', 858 entries recorded, 58.39% different with the labeled data
'mdanalysis', 1274 entries recorded, 37.05% different with the labeled data
'libmesh', 1001 entries recorded, 39.46% different with the labeled data

However, I did label the data until the buggy commit number is 90~95%+ of the estimated buggy commit number according to fastread. Therefore, it is safe to assume that the rest of undetermined bug or non-bug commits are non-buggy or non-buggy (with 5-10% chance of buggy commits) and just compare with all of the data being labeled during the hackathon, the result is much better.

'abinit', 6.54% different with the labeled data
'lammps', 6.85% different with the labeled data
'mdanalysis', 14.74% different with the labeled data
'libmesh', 8.01% different with the labeled data

HuyTu7 commented 6 years ago

By looking at these number, conclusions: 1/ with high duplication of a proportion of labeled data with low inconsistency, the data can be considered highly valid.
2/ some inconsistency between the hackathon data and the using Fastread to label data that can be concerning that can be due to a lot of the commits to fix dealing with fixing during writing papers, fixing documentation, etc that highly possible considered as bug fixes during the hackathon but not when during the sanity check.

Next steps: 1/ Looking to find some rules or NLP methods to summarize + automate the labeling process.

timm commented 6 years ago

well done, you've confused your supervisor. you'll need to explain this to me next time we speak

timm commented 6 years ago

predicted	Truth=0	Truth=1
predicted=n
predicted=y

timm commented 6 years ago

abing:

predicted	Truth=0	Truth=1
predicted=n	377	68
predicted=y	190	491
undetermined	3712	79

lammps:

predicted	Truth=0	Truth=1
predicted=n	283	2
predicted=y	499	74
undetermined	6457	9

mdanalysis:

predicted	Truth=0	Truth=1
predicted=n	480	19
predicted=y	453	322
undetermined	2013	16

libmesh:

predicted	Truth=0	Truth=1
predicted=n	715	107
predicted=y	701	698
undetermined	6431	27

timm commented 6 years ago

estimates for reading time for 10,000 projects

assumes 90 minutes to skim the issue reports

then reduced to 30. by Zhe's tool

assuming 1.1 people reading each projects (so when there are disputes, can ask someone else)

hours = 1.1 30/60 10000 = 5500 hours
on mechanical turk, at $10/hour: 55,000 dollars

well to be honest I'd double these costs

timm commented 6 years ago

@HuyTu7 @azhe825 : there is something odd about the "A" numbers on the above tables. @azhe825 please get with @HuyTu7 and sort that out.

thanks!

azhe825 commented 6 years ago

@Huy Tu hqtu@ncsu.edu if you are convenient, please meet me in the lab tomorrow.

Best Regards,

Zhe, Ph.D. scholar @ CS, NcState http://azhe825.github.io http://azhe825@github.io

On Mon, Oct 1, 2018 at 1:52 PM Tim Menzies notifications@github.com wrote:

@HuyTu7 https://github.com/HuyTu7 @azhe825 https://github.com/azhe825 : there is something odd about the "A" numbers on the above tables. @azhe825 https://github.com/azhe825 please get with @HuyTu7 https://github.com/HuyTu7 and sort that out.

thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/se4sci/auto-labeler/issues/1#issuecomment-426000920, or mute the thread https://github.com/notifications/unsubscribe-auth/ANSK7fZxK3tSXI8ifNJF4f4xonxnhk0Nks5uglZEgaJpZM4W_0ED .

HuyTu7 commented 6 years ago

Will do, @timm and @azhe825 .

se4sci / auto-labeler

Sanity check on hackathon labels data #1