sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
19 stars 12 forks source link

timestamps of committers are incorrectly assigned to authors in social smell notebook #106

Closed carlosparadis closed 3 years ago

carlosparadis commented 3 years ago

@tuejari found this bug on the social-smell branch notebook. @harismumtaz18 @CorneJB this may affect the current tables you generated, you may want to double check.

CorneJB commented 3 years ago

Thanks for the heads up! These would be the timestamps in the project_git table? If it is about the timestamps not matching the actual analyzed window in the final table of the notebook, I think I have noticed it. Did not really register that it was actually a bug. Great catch!

carlosparadis commented 3 years ago

@CorneJB wow that was a quick follow-up! :) I was about to send the commit with the fix haha. Take a look at the diff above, the exact offender is here:

Screen Shot 2021-07-15 at 5 18 33 AM
CorneJB commented 3 years ago

@carlosparadis Do you think it affects the detection of smells when using bipartite projection? I just handed everything in a few hours before this message :sweat_smile:, so what is done is done. But otherwise, I will have to append some stuff. It might have introduced some noise on the front and tail end of the window selection, but if it didn't affect the bipartite smells I think I might be okay.

carlosparadis commented 3 years ago

@CorneJB Since it is only the timestamp, and not the author name/email itself that is incorrect, the only place in the pipeline I imagine this affects is the division of time slices using the commit timestamp instead of the author timestamp. So you can actually check that yourself: Subtract the author timestamp from the committer timestamp on your dataset, and check if the difference is less than the size of the time window you chose for the slices (the default was 3 months in the Notebook). From memory, I don't think anything else would get affected (since all metrics are computed after a slice is created).

In layman terms: If the time between the author submission (e.g. pull request or patch) is less than 3 months (or another time window you chose), then the slices should be the same, and consequently, I'd expect all the results to be the same.

The interesting thing is that, despite being an oversight on my part on that line of code, the theory behind it is not necessarily "a bug" or "noise" I think: Yes, you are assessing the network of author file changes from git log, and comparing it with when the authors possibly shared e-mail threads (e.g. radio silence metric) on the mailing list --but-- you opted to assume discussion would happen during the time of commit instead of the the time of the patch/pull request. If you consider pull request and patches can sit around for a while until a reviewer come by and a discussion occurs, maybe this mistake can even be more accurate as an assumption. Conversely, you could argue discussion would happen closer to when the author would be trying to make the patch...or maybe both! :^) Hope this gives you some food for thought! (pinging @harismumtaz18 @tuejari since this may be of interest to them). Maybe @maelstromdat or @rnkazman have another interpretation of this too.

But by all means, don't take my word for it, do give some thinking based on the code and the implications! This is one reason why I decided to follow the "Here's a notebook so you see the analysis steps and intermediate data", rather than "here is a command-line you run and whatever output you get take my word for it!" 😅