se-sic / coronet

coronet – the R library for configurable and reproducible construction of developer networks
GNU General Public License v2.0
7 stars 15 forks source link

Improve simplification of networks #120

Open clhunsen opened 6 years ago

clhunsen commented 6 years ago

Description

Currently, the code in the function simplify.network() is quite complex: To preserve an edge for each existing relation, we need to obtain the complete edge list, split it, construct networks from the separate different edge lists again, simplify these, and merge the separate networks in the end (by obtaining their edge lists...).

We need to find a way to apply the EDGE.ATTR.HANDLING on the raw edge lists (i.e., data.frames). Maybe, using the existing list when constructing an appropriate sqldfstatement is a worth a try. Also, trying to use "unique" as a value in the EDGE.ATTR.HANDLING list may solve the issue right away. :wink:

https://github.com/se-passau/codeface-extraction-r/blob/2241817620505a396846da1d2afa0264afafb55a/util-networks.R#L1192-L1233

https://github.com/se-passau/codeface-extraction-r/blob/2241817620505a396846da1d2afa0264afafb55a/util-networks.R#L44-L63

Versions

This affects the upcoming version v3.2 as it contains the code of PR #115.

MaLoefUDS commented 2 weeks ago

I've come to look at this issue. After inspection, I find that the described issue is indeed present, i.e., we do indeed need to perform the described steps to simplify a network properly. Im unsure what the performance impact of this convolution is though.

Towards the proposed solutions: I don't think we can use "unique" in the EDGE.ATTR.HANDLING, for the reason that I am just unsure what the expected outcome would be. Same goes for the sqldf suggestion. The underlying issue is that the igraph::simplify function (that we use) has no notion of grouping over certain attributes. Therefore, we have to pre-group ourselves. In what way we perform pre-grouping can only have minor impacts on performance and readability I suppose.

Possible solutions I see are: Reimplementing the igraph::simplify function with a notion of grouping. This approach may not be a fan favorite 😂. Then, we could also go to the drawing board and figure out how to include grouping into custom lambda functions that we then set as values into the EDGE.ATTR.HANDLING. This may work, but I have not yet thought about the specifics.

MaLoefUDS commented 1 week ago

Preamble: In our last meeting, we discussed, that any form of performing simplification on the edge list directly involves us reimplementing igraphs simplification logic, which as enticing it may sound for the performing of our code, is definitely not something we want to do. Nevertheless, we do want to reduce the amount of times, where we have to obtain, edit, set, obtain, edit, set ... the edge lists of the individual networks. A good step into that direction is to replace the call to merge.networks with something implemented by the igraph people, in the hope that they may implemented it in a more performant way.

So I have been searching the docs for a function to merge networks natively. igraph::union does not do what we want, as it creates duplicate columns in the edges for every column that appears multiple times in the input networks, i.e., the unified network would have columns type_1, type_2, type_3 ... kind_1, kind_2, kind_3, etc. which would require us to once again iterate the edge list and merge together these artifacts. However, I found that the igraph::disjoint_union function does what we need. The function assumes that input nets are disjoint and creates an output network in which there are copies of every vertex of all input nets with a single nicely formated edgelist. This means in our case, that if we merge 4 networks and all have these vertices A, B, C, then the disjoint union will have vertices A, B, C, A, B, C, A, B, C, A, B, C but the edge list will appear "correctly". We can then contract all vertices that are effectively equivalent through igraph::contract and then achieve the same output as with merge.networks (all tests ☑️). I am not here to judge for the efficiency of this solution. Under the assumption, that the igraph people did a nice job in the implementation of disjoint_union and contract, then we may achieve a significant benefit, otherwise, Im not sure. Also I don't have the big testing data on me to evaluate performance. Nevertheless, this solution does reduces code duplication and would make merge.networks and merge.network.data mostly obsolete.

My code (util-network.R line 1720ff):

## merge the simplified networks
network = igraph::disjoint_union(networks)
vertex.count = igraph::vcount(networks[[1]])
network = igraph::contract(network,
                           rep(1:vertex.count, length(networks)),
                           vertex.attr.comb = "first")