sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
18 stars 12 forks source link

Temporal Networks do not account for weights #229

Closed carlosparadis closed 7 months ago

carlosparadis commented 1 year ago

Nicole picked up the temporal projection in Kaiaulu does not account for weights, whereas Codeface does. This likely heavily impact the difference in output.

Kaiaulu: https://github.com/sailuh/kaiaulu/blob/master/R/network.R#L161-L194

Codeface: https://github.com/lfd/codeface/blob/nicole-updates/codeface/cluster/cluster.py#L500-L501

carlosparadis commented 10 months ago

Here are some more details on Kaiaulu side I remembered while going through the network.R and graph.R.

When we start off the raw data, some data sources are naturally a graph by definition. For example, the dependencies that Depends output are a node and edgelist. On the other hand, the git log itself, where in tabular form every row is a commit that changes one file, is a edgelist of a graph, and the node table has to be derived of it.

In Kaiaulu, both cases are handled by first calling a transform_* function from network.R, which basically is how the user specify what information should be used to make a unimodal or bipartite graph, and then the graph function is called to construct it.

There is a caveat in the graph construction, however, depending on what we are trying to do with the graph, which I didn't realize before. At its most basic form, a graph edgelist table can be represented in two ways. Suppose the following Git Log:

After calling the graph function to define nodes and edgelist for the bipartite graph Dev - File, the edgelist would contain:

The Case B leads to a graph whose tables have duplicated edges, which some graph libraries will not like it. Case A will address that, and pass the duplication as weight so the information is not lost.

When computing a bipartite projection, the Case A works just fine. The graph weight should be defined at that point, and then the bipartite function in graph.R requests the user to choose how to combine the weights.

The story is different for a temporal transformation instead. Here, we can't combine duplicated edges (Case A) apriori, but rather have to preserve them (Case B), including additional edge metadata needed for said temporal edge (i.e. the timestamp of the edge). We have to preserve the duplicated edges (along with timestamp), because we have to derive in the resulting graph using them.

Therefore, the model_direct_graph and associated network functions should have a more general model to represent graphs, and indicate if edges are duplicated or not.

carlosparadis commented 10 months ago

One additional note on temporal: It seems Kaiaulu implementation differs from the original definition it cites. Namely, given a file or entity, Kaiaulu temporal will only consider the current and immediately before commit when passing the weights to the new temporal edge.

I believe in the original formulation of Codeface, if dev A changes a function in the given 3 month time window with churn 10, dev B with 3 and dev C with 5, then

dev A <- dev B (10 + 3) dev B <- dev C (3 + 5) dev A <- dev C (10 + 5) ---- Kaiaulu will not generate this edge

In other words, I believe in Codeface implementation, every developer within the 3 month time window that changes code after another, will have an edge to every developer that preceded it, with the weights of all the contributions the preceding developer did to the function up to that point.

Kaiaulu implementation only assumes collaboration and the churn (or any other metric chosen as weight to the graph) of the previous developer.

Another distinction from my understanding is that Kaiaulu will generate auto loop edges if a developer commits in succession to his/her code. Codeface will not include those edges.

This interpretation is based of Figure 3.2 example on Mitchell's dissertation.

I am not sure which interpretation is better. On Codeface case, we assume the current developer has to consider the changes of all developers that came before, up to the time window limit. On Kaiaulu, just the one before (even if that is oneself). I imagine at some point the code logic of more recent changes may overwrite entirely the code logic of further way changes, which lead to a decay weight parameter. :^) Well, I will leave Kaiaulu as is for now, and focus on the code interface with the rest of the codebase and add a note to the function description.

carlosparadis commented 10 months ago

Accidentally forgot to note the issue ID on the branch, so it did not show above, but here's the reference: https://github.com/sailuh/kaiaulu/commit/fe36737bb681b5b87d218f80fabea39a9107d100

This commit adds the original formulation of Mitchell's dissertation and Codeface. Since the original formulation only implements one of the two and calls it temporal projection, I will refer to the original as the all_time_lag_temporal_projection whereas the current implementation will be eventually renamed to one_time_lag_temporal_projection (pending editing on the branch code).

The difference, as the name suggests and as noted on the comment above, is whether on a situation of Dev A, Dev B, Dev C, commits in this order, if we establish the connection Dev C -> Dev A (a 2-time lag dependency), or not.

all_time_lag_temporal_projection -- simpler case

I have no idea how Codeface actually implements this, and based off my understanding from Mitchell's dissertation instead. I will reproduce it here, but if it becomes an issue I will just make a doodle out of it:

Screen Shot 2023-11-12 at 7 24 45 AM

This is reflected in the following unit test in this Kaiaulu branch:

test_that("all time lag temporal projection matches original formulation", {

  timestamps <-  as.POSIXct(c("Tue Aug 17 15:59:33 1999 +0000","Tue Aug 17 16:59:33 1999 +0000",
                               "Tue Aug 17 17:59:33 1999 +0000","Tue Aug 17 18:59:33 1999 +0000"),
                             format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC")

  project_git <- data.table(file_pathname = c("file_a","file_a","file_a","file_a"),
                             author_name_email = c("dev 1","dev 2","dev 2","dev 3"),
                             committer_name_email = c("dev 1","dev 2","dev 2","dev 3"),
                             author_datetimetz = timestamps,
                             committer_datetimetz = timestamps,
                             lines_added = c(4,2,3,2),
                             lines_removed = c(0,0,0,0))

  #git_graph <- transform_gitlog_to_bipartite_network(project_git,"author-file")
  git_graph <- copy(project_git)
  setnames(git_graph,
           old = c("author_name_email",
                   "file_pathname"),
           new = c("from",
                   "to"))
  git_graph <- model_directed_graph(git_graph,is_bipartite = TRUE, color = c("black","#f4dbb5"), aggregate_duplicate = FALSE)
  git_graph[["edgelist"]]$weight <- git_graph[["edgelist"]]$lines_added + git_graph[["edgelist"]]$lines_removed

  temporal_projection <- temporal_window_graph_projection(git_graph,mode=TRUE,timestamp_column ="author_datetimetz",
                                                          weight_scheme_function = weight_scheme_cum_temporal)

  expect_equal(temporal_projection[["edgelist"]][from == "dev 2" & to == "dev 1"]$weight, 4+2+3)
  expect_equal(temporal_projection[["edgelist"]][from == "dev 3" & to == "dev 1"]$weight, 4+2)
  expect_equal(temporal_projection[["edgelist"]][from == "dev 3" & to == "dev 2"]$weight, 2+3+2)
  expect_equal(temporal_projection[["edgelist"]][from == "dev 2" & to == "dev 2"]$weight, 2+3)
})

The project_git simulates the 4 commits of the upper portion of the image, whereas the the expect equal matches the lower portion of the image on the collaboration between the developers and the direction of the edges. They fully match.

Note that Kaiaulu will also provide one additional piece of information the original formulation doesn't (at least not on the example), which is the auto-loops (last equal statement). If a developer makes a commit after oneself, this will still be reported as an auto-loop. If the information is not of relevance, a simple subset of the final table where from != to results in the table of the original formulation.

all_time_lag_temporal_projection -- alternating developers

In the dissertation, Mitchell notes that the defining equation of this transformation considers directionality of the edges, and as such w_d1_d2(f) != w_d2_d1. However, the example does not convey this case, as every developer never "comes back" to commit to the same function again.

I made the following doodle to represent one such case:

Screen Shot 2023-11-12 at 7 38 16 AM

This is reflected in the following unit test:

test_that("all time lag temporal projection matches original formulation with alternating authors", {

  timestamps <-  as.POSIXct(c("Tue Aug 17 15:59:33 1999 +0000","Tue Aug 17 16:59:33 1999 +0000",
                              "Tue Aug 17 17:59:33 1999 +0000","Tue Aug 17 18:59:33 1999 +0000"),
                            format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC")

  project_git <- data.table(file_pathname = c("file_a","file_a","file_a","file_a"),
                            author_name_email = c("dev 1","dev 2","dev 1","dev 2"),
                            committer_name_email = c("dev 1","dev 2","dev 1","dev 2"),
                            author_datetimetz = timestamps,
                            committer_datetimetz = timestamps,
                            lines_added = c(1,3,5,7),
                            lines_removed = c(0,0,0,0))

  #git_graph <- transform_gitlog_to_bipartite_network(project_git,"author-file")
  git_graph <- copy(project_git)
  setnames(git_graph,
           old = c("author_name_email",
                   "file_pathname"),
           new = c("from",
                   "to"))
  git_graph <- model_directed_graph(git_graph,is_bipartite = TRUE, color = c("black","#f4dbb5"), aggregate_duplicate = FALSE)
  git_graph[["edgelist"]]$weight <- git_graph[["edgelist"]]$lines_added + git_graph[["edgelist"]]$lines_removed

  temporal_projection <- temporal_window_graph_projection(git_graph,mode=TRUE,timestamp_column ="author_datetimetz",
                                                          weight_scheme_function = weight_scheme_cum_temporal)

  expect_equal(temporal_projection[["edgelist"]][from == "dev 2" & to == "dev 1"]$weight, 1+3+5+7)
  expect_equal(temporal_projection[["edgelist"]][from == "dev 1" & to == "dev 2"]$weight, 3+5)
  expect_equal(temporal_projection[["edgelist"]][from == "dev 1" & to == "dev 1"]$weight, 1+5)
  expect_equal(temporal_projection[["edgelist"]][from == "dev 2" & to == "dev 2"]$weight, 3+7)
})

My interpretation of the alternating case is that Dev 2 contributes with Dev 1 across all weights available, as the last commit pertains to them. So, by definition of all_time_lags every code contribution to the same file has to be accounted for.

In contrast, when we ask how much Dev 1 contributed with Dev 2, two edges are not included: The edge from commit 1, and the edge of commit 4. The edge of commit 1 is not included because developer 2 never changed that file at that point in time (note the weight 1 will be included on the auto-loop of Dev 1, however!). The edge of commit 4 does not get included because it happened after all contributions of Dev 1. Thus, this respects the property of the original formulation.

Which formulation makes more sense?

No idea! Might as well leave both available. Ultimately, in Kaiaulu implementation, the temporal formulation is not hardcoded to git log! You can apply to literally any graph, even one not at all related to Software Engineering, so some datasets may benefit from one over the other.

For instance, if we have reason to believe the change of the next person entirely re-writes the previous file, then I'd assume the one_lag to makes more sense: Subsequent people will not really be building of "past people work" if everyone keeps erasing everything. No trace will be left, and no effort on understanding anyone past the first time lag will occur.

In addition, Kaiaulu formulation for the all_time_lag is still parameterized by a weight function. This means one could specify the weight function = NULL, study the table of edges, and even apply a decaying factor to the weight of collaboration as a function of time.

More details on this on the next message.

carlosparadis commented 10 months ago

Kaiaulu now has 3 functions for projections, and they have some overlap. It may in fact be possible to combine all 3 projection functions into one, and parameterize not only the weight function, but also the formation of edges as a parameter function. I will try the temporal first after testing the all_lag a bit more.

nicolehoess commented 10 months ago

Codeface currently implements the following weight scheme for temporal networks: weight_scheme_codeface