sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
18 stars 12 forks source link

Adding developer turnover metrics to smells notebook #96

Open CorneJB opened 3 years ago

CorneJB commented 3 years ago

Hi!

I made this today for analyzing the turnover between analysis windows based on the quality metrics in Magnoni (2016)'s dissertation. I did not really know where to put it in terms of branches, issues, and pulls, so I'm dropping it here to get some feedback and create a branch/pull if it is needed/wanted. The for-loop below uses project_git_slicestack, which is a list containing the slices that are made during the social smells part of the social smells showcase vignette.

## Initiate turnover variable
code_dev_turnover <- list()

## Loop through slices and calculate turnover based on Magnoni (2016)
for(i in 1:length(project_git_slicestack)){

  ## Calculate developer turnover between slices and add to smells table
  if(i != 1){

    x1 <- unique(project_git_slicestack[[i - 1]]$author_name_email)
    x2 <- unique(project_git_slicestack[[i]]$author_name_email)

    neldy <- length(setdiff(x1,x2))
    neby <- length(x1)
    neey <- length(x2)

    code_dev_turnover[i] <- signif((neldy / ((neby + neey)/2)), digits = 3)

    smells[[i]]$code_dev_turnover <- code_dev_turnover[i]

  } else { ## Don't calculate for the fist window and add to smells table

    code_dev_turnover[i] <- "NULL"

    smells[[i]]$code_dev_turnover <- code_dev_turnover[i]

  }

Kind regards,

Corné Broere

carlosparadis commented 3 years ago

Thank you for taking the initiative on this! Yes, starting with an issue is the way to go to sort everything else out after 👍

I've been procrastinating on thinking about how to approach this without turning this into a 5000 line loop as it was originally implemented once we had everything in. So insofar these loops as you already saw have resided on R Notebooks (the Social Smell Notebook for example has a few of them), and copy and pasted a couple of times instead of just done once for the sake of readability. Sorting this out would also give us a reasonable way to implement black_cloud which relies on consecutive time window slices.

I'm thinking the best way we can approach this is using the lapply function, and passing to it the slices (as you did) and a list of the functions we wish to apply to each slice as a parameter (and optionally the previous slice). I believe data.table will automatically use its lapply C implementation if we do so, which should also speed up the code considerably for long iterations.

We should also want to have some flexibility on what we use to define the slices in the first place. Right now I think it is based on the git log, but could also be the mailing list time range, or issue tracker, etc.

Passing consecutive slices, however, via lapply may be a bit tricky. Do you want to try to give it a shot?

p.s.: what neldy, neby, andneey means?

CorneJB commented 3 years ago

Thanks for the response! I'm glad I chose the right channel for a message like this. The way I did it now was just storing both the git and the mbox slices during the for loop for the smells.

One of the challenges is that identity_match() happens on the slice level and this results in non-congruent identity_id's across slices. Using the mail-adresses and names in the slices introduces the noise that identity_match() normally solves. This could maybe be solved by applying identity_match() outside of the loop on the whole gitlog (probably incurring a performance hit) or using the start/end commit to define where it is applied.

lapply sounds like a great fit for something like this. I would be happy to give it a try. In terms of pseudocode have I understood it correctly if it's something like this:

for(i in window){
...loop that generates slicestack and calculates smells...
}

quality1 <- lappy(slicestack, quality_function1)
quality2 <- lappy(slicestack, quality_function2)
etc.

The neldy, neby, and neey are (maybe) acronyms from the Magnoni (2016, p. 106) paper. I must admit that I have not been able to decipher what they mean, but I thought it would aid in people cross referencing the code with the dissertation.

Kind regards,

Corne

carlosparadis commented 3 years ago

One of the challenges is that identity_match() happens on the slice level and this results in non-congruent identity_id's across slices. Using the mail addresses and names in the slices introduces the noise that identity_match() normally solves. This could maybe be solved by applying identity_match() outside of the loop on the whole gitlog (probably incurring a performance hit) or using the start/end commit to defining where it is applied.

You are right, this is actually done the way you suggested in the newer notebooks. Thanks for catching this! That was an oversight on my part due to legacy code. Since this is a separate issue, I created a new one to address it. Could you take a look on #97 to see if that would work for the turnover better? :)

As for the lapply:

for(i in window){
...loop that generates slicestack and calculates smells...
}

Ideally, we should actually remove the smells from this loop too. To be more precise, right now some of the quality metrics, such as num.tz, number of developers, etc are also in this loop. They should be outside, each as a function, since each takes as input a git log slice, a mailing list slice, or both. To be even more precise, let's look at the code in question:

https://github.com/sailuh/kaiaulu/blob/2486a1a5d775e1338de20980266c67b52df23db3/vignettes/social_smells_showcase.Rmd#L173-L215

Right now, we have a list of time intervals we loop over, and at each step both git log and mailing list get subset to generate the slices. This should likely be turned into a function, which outputs the slices as a list. This function should likely be similar to how identity_match() input is like: You take as input a list of tables, that are guaranteed to have certain columns defined. In the case of identity match, those are the name and e-mail of users. That way, in the future, if we want to add, say, an issue tracker as a replacement or as an extra to mailing list, the function can accommodate it as an additional table to the passed list.

The same problem is true here: Right now to compute social smells we may either need a gitlog, a mailing list, or both. In the future, we may also want to use an issue tracker as a source of communication, much like the identity match. So again, passing a list of tables is likely the way to go. I hope you see the analogy in both concept and code here.

Now, contrary to identity match which purpose is just to perform identity match, we will want here to calculate multiple metrics of the tables passed in.

Concerning this:

quality1 <- lappy(slicestack, quality_function1)
quality2 <- lappy(slicestack, quality_function2)

we will likely need something like this instead:

quality_metrics <- lapply(slicestack,some_function,smell_and_quality_functions_stack)

Otherwise, we will have to loop over the slices times the number of metrics used. Since this number is large, and since projects like OpenSSL can be as long as 20 years, this can nuke the performance. So we want to still do it as much as possible in fewer loops, much not make a 3000 line loop which is hard to maintain.

Note this is where it requires more thought about how to implement the turnover, since you need both the "current slice" as well as the "past slice". This is trivial in a for loop, since you can just i and i-1, but this is not that obvious in an lapply. And this is why exactly I've been slacking on coming out with a better way to modularize this.

Whatever is chosen, this code should also be refactored into functions that can be used:

https://github.com/sailuh/kaiaulu/blob/2486a1a5d775e1338de20980266c67b52df23db3/vignettes/social_smells_showcase.Rmd#L266-L281

There is one more small thing I should point out, that may impact your analysis. Consider that, each slice of 3 months (or the slice of the time window you choose) subsets separately the git log and mailing list. It is unlikely on a large project, but it is possible one of the 4 cases occur:

In the code I wrote, which is similar to the one you are thinking of now, the slices were smaller, since I looked at them "per project issue", and these 4 case types appeared much more often. This in turn can make a smell to not being computed for certain slices, if it requires the gitlog, mailing list, or both. Whatever function we come up with to iterate and compute these, has to account for that.

Were you planning on adding other quality metrics or the black cloud social smell for your thesis? I can put more time into this so we can agree faster on some templates to make it easier for you to use them if so! It will probably be easier for us to iterate on this once I get the code that handles these 4 classes on the repo!

CorneJB commented 3 years ago

Thank you for the extensive write-up on this. I think I could definitely give the lapply() implementation on the smells a swing. There are apparently some other flavors like mapply() and xapply() available that might be able to provide some kind of solution.

For my thesis I am going to use only the Organizational Silo, Radio Silence, and Missing Link, so there is no pressure on my part on implementing black cloud. I'm really enjoying contributing though and want to give back, so if I can prove useful in any way to implement more smells I would happily do so.

It is very funny that you mention those 4 cases because one of those cases caused my first few hours struggle to get kaiaulu working. The first range I tried to analyze had no mailing list data, which results in some kind of non-descriptive x is not a bipartite graph supply types' argument error. Putting N/A in the table where it is applicable and writing some documentation that explains the probable cause would already clear it up a lot I think.

carlosparadis commented 3 years ago

Yup! You're right. Once I push the other code up, I will transfer over the 4 conditional cases to this loop too. The other variants to lapply are ok, although we want to make sure they are also implemented in C via the data.table:: package. Do ask me if you get stuck in anything on the code please :^) I could have saved you some time!

Since this seems it will take more time, and there are already the 4 cases to be considered, let's just stick with the for loops for now!

How about this: Could you do a PR for the turnover loop code as an additional code block after this line: https://github.com/sailuh/kaiaulu/blob/c5487692493c7485b8a8fe94b6c916332a8286fe/vignettes/social_smells_showcase.Rmd#L298, and add a header # Consecutive Time Slices Metrics?

Also, ok on holding on black cloud! Were you planning on adding any other quality metrics for your thesis?

CorneJB commented 3 years ago

Thanks! The for loops seem to perform pretty good on my laptop on big projects like nginx or openssl. I will try and create a PR as soon as possible.

For my thesis, I'm trying to delve deeper into the turnover metrics and community smells (I kind of moved away from license since it's so hard to operationalize and gather enough data since there are so many). I made something in my private forks that creates a matrix of what dev was implicated in what smell and if the dev turned over in the next window after they were implicated, so it calculates something like this:

identity_id dev in_org_silo in_radio_silence turnedover
1 A 1 0 0
2 B 1 1 1
3 C 0 1 1

It still needs to be updated with the new identity_match() implementation, making it more accurate. Hopefully allowing me to do some probability analysis on the community smells in relation to turning over. I'm still cracking the books on what control variables (which will probably be other quality metrics) might be needed to make this as theoretically sound as possible.

Kind regards,

Corné Broere

carlosparadis commented 3 years ago

This looks very interesting! It is true the social smells functions returns more than just count, so it is nice seeing it being capitalized over :).

Some food for thought:

a) You could technically have four representations of that matrix, (bipartite, temporal) x (file, entity), since they influence the social smells. Which setup makes the most sense in your case?

b) Do you plan to also differentiate the turnover metrics as Simone did on p.106? i.e.

  1. turnover of global members (global.turnover);
  2. turnover of collaboration members (code.turnover);
  3. turnover of global core members (core.global.turnover);
  4. turnover of communication core members (core.mail.turnover);
  5. turnover of collaboration core members (core.code.turnover).

c) The paper titled "The Canary in the Coal Mine...", A cautionary tale from the decline of SourceForge" https://onlinelibrary.wiley.com/doi/epdf/10.1002/spe.2874, focuses on the Smelly-Quitters quality framework metric (defined on Section 3.2.1). Figures 7 and 8 may give you some interesting ideas!

d) @maelstromdat (Damian) recently versioned the original code from Simone's dissertation: https://github.com/sailuh/kaiaulu/blob/88-add-social-smells/sociotechnical.R. Odds are that you can find the code you want there so you don't need to reimplement everything (the social smells you are using was ported from this file). If you do port over, try to not rely on igraph (that is a huge dependency I'd prefer not to require in Kaiaulu API). I still need to remove said dependency from the social smells.

Way down the road if you wish to push your final thesis analysis as notebooks you would be more than welcome too :) It would be a good test to see if the current project configuration files suffice to represent everything needed as they were originally intended.

Thanks!

CorneJB commented 3 years ago

Hi Carlos,

Thanks for thinking with me here. I've put each of the answers/counterquestions below.

a.) I am still not quite sure what the difference is between the temporal vs the bipartite interpretation over time? As I understand it now the temporal interpretation is at a discrete point in time, but since we are analyzing a window, wouldn't a bipartite projection also represent a discrete point in time? I think entity is the way to go, since I am most interested in the behavior of the contributors in relation to their smells and less so the specific contributions they make or characteristics of those contributions.

b.) Yes! I have actually added a version of that now and calculate the true turnover based on some logic that checks if they turned over from the mailing list, git, or both. Before I calculated it in a way that meant that if a developer never contributed to the code or mailinglist, it would mean that that developer would automatically be marked as turned over. But it seems hard to quit something you have never done :thinking:. My matrix currently looks like this, I should probably factor in the correct variable names but the pressure of write-up time is bearing down on me.

identity_id dev in_org_silo in_radio_silence code_turnedover ml_turnedover nocode notalk true_turnover ml_activity git_activity
2 Remy 1 0 0 0 0 1 0 0 109
3 Mark 1 0 0 0 0 0 0 252 617
6 Coty 1 0 1 0 0 0 0 26 8
7 Keiichi 1 0 0 0 0 0 0 4 43
8 Jean 0 1 0 0 0 0 0 1 0

c.) Thanks! This paper provides some really nice theoretical pointers and interesting ways of visualizing what is happening in projects. I'm zooming in on the Smelly-Quitters by seeing what specific smells they have been implicated in, and hopefully finding some interesting results :sweat_smile:

d.) I used a lot of what is in there as inspiration for my implementation of the metrics I am looking at. But due to the dependency on igraph and my limited knowledge of that package and graph theory I tried to keep it a bit more simple. That way I have a better understanding of what I am actually calculating.

The notebook will definitely be shared when it's finished (I expect some serious polishing to be necessary after that), although I am ashamed to admit that my personal part has spiraled into the 300 lines of foor loops territory.

Thanks again and kind regards,

Corné

carlosparadis commented 3 years ago

a)

a.) I am still not quite sure what the difference is between the temporal vs the bipartite interpretation over time? As I understand it now the temporal interpretation is at a discrete point in time, but since we are analyzing a window, wouldn't a bipartite projection also represent a discrete point in time? I think entity is the way to go, since I am most interested in the behavior of the contributors in relation to their smells and less so the specific contributions they make or characteristics of those contributions.

Here's an image from my dissertation I borrowed and adapted from Mitchel's dissertation. Please let me know if this makes more sense :)

Here's an example of how temporal and projection differ. You can see from here that projections will be more prone to create more edges too. Note the temporal method generate directed edges, and the direction of the edges is based on the image above order of t_i, t_i+1, t_i+2 whereas the projection generates a undirected graph (represented by the arrows going both ways).

joblin_figure 2 12_temporal_vs_projection (2)

As for file vs entity, here's one example too comparing both. Note files will lead to more edges than functions too:

joblin_figure 2 12 (7)

Note in this figure, what you define as entity (e.g. classes, functions, etc) will also influence your metrics. In a way, file itself is an entity. Modules or the entire project would be coarser ones.

I am just reminding you this now that you are more familiar with the pipeline, since this implicitly influences what smells you will get, and the metrics you compute :) So more food for thought for your work!

b)

I should probably factor in the correct variable names but the pressure of write-up time is bearing down on me.

Hahaha! No worries 😄 There is always hope so long you can understand the code after you are done writing! And I am glad you considered the various cases!

c)

c.) Thanks! This paper provides some really nice theoretical pointers and interesting ways of visualizing what is happening in projects. I'm zooming in on the Smelly-Quitters by seeing what specific smells they have been implicated in, and hopefully finding some interesting results 😅

Glad it was of use! :)

d)

d.) I used a lot of what is in there as inspiration for my implementation of the metrics I am looking at. But due to the dependency on igraph and my limited knowledge of that package and graph theory I tried to keep it a bit more simple. That way I have a better understanding of what I am actually calculating.

I understand! Let me know if there is anything, in particular, you may be curious about in the code or the dissertation from the file I pointed that may be of your interest down the road, I can try cutting a few pieces of the code, or some abstraction or example that may save you time.

The notebook will definitely be shared when it's finished (I expect some serious polishing to be necessary after that), although I am ashamed to admit that my personal part has spiraled into the 300 lines of foor loops territory.

Much appreciated! No worries, so long the code can be understood, performance can always be worked out later. Good luck, and thanks for the contributions! 🚀

CorneJB commented 3 years ago

Thanks for the thorough explanation. Turns out I fully misunderstood what multiple of those concepts meant :sweat_smile:. I really like doing work on the project and after write up am committed to integrating the functionality (if it turns out to be useful haha). I appreciate your guidance and consideration deeply.

Regards,

Corné