se-sic / coronet

coronet – the R library for configurable and reproducible construction of developer networks
GNU General Public License v2.0
7 stars 15 forks source link

Fix edge-attribute handling when simpliyfing networks #251

Closed bockthom closed 7 months ago

bockthom commented 9 months ago

In PR #250, we introduce a new possibility for simplifying multi-relation edges. However, there are no tests yet for this new feature, and also the edge-attribute handling still needs to be adjusted accordingly. Hence, in this issue, I describe the missing steps that should be approached after PR #250 is merged:

Let me start with an example: Assume that we have an author network containing two vertices A and B, which has multiple relations, namely "mail" and "cochange". Assume that there are three cochange edges and four mail edges between A and B. Then the default simplification reduced the three cochange edges to one cochange edge, and the four mail edges to one mail edge. The new "simplify.multiple.relations" parameter instead simplifies the three cochange and four mail at edges to one single edges that represents all mail and cochange edges together.

Problem: Edge attributes need to be handled properly for the new case. The edge-attribute handling for simplification is specified here: https://github.com/se-sic/coronet/blob/a0c3a5262a8837679e2c04343e10b2e41dcab4df/util-networks.R#L55-L70

Currently, we only take the "first" relation - which was not a problem, was only edges of the same relation were simplified to one edge. Now, when we simplify edges of different relations to the same edge, "first" is wrong, as we need a unique vector of all relations instead of just taking the first one. Unfortunately, the edge-attribute handling of igraph does not provide a dedicated function for that. However, igraph allows to provide a custom function that is applied when handling edge attributes during simplification. Hence, such a custom function is necessary for relation here.

However, changing the edge-attribute handling will also break some other parts of coronet, which assume that the edge attribute relation are only a single element - but now it also can be a vector. For example, this might break the following lines, in which the data sources are determined from the relations - this might break if there are vectors of relations instead of single elements: https://github.com/se-sic/coronet/blob/a0c3a5262a8837679e2c04343e10b2e41dcab4df/util-networks.R#L1673-L1689

It is important, that the new functionality is backwards compatible, this is, if the default simplification procedure is used, then everything should work as before, and only if there are multiple relations combined into a single edge, we need to find a way that all the functions are also able to deal with a vector of relations instead of a single edge.

In addition, PR #250 also lacks of tests. So proper tests should also be added :wink:

maxloeffler commented 9 months ago

I have implemented a simple disambiguation of the relation attribute here. After executing all tests, I did not find any breaking sections in coronet (as of right now, I did not add any tests myself for this). Especially for the case you mentioned, im not sure why it should break. Manual inspection showed it working just fine 😄

bockthom commented 9 months ago

I have implemented a simple disambiguation of the relation attribute here.

I've never used modifyList, but it seems to be helpful :) However, I am not sure if your implementation works as intended in all cases. One characteristic that I forgot mentioning before was sorting: We need to make sure that c("cochange, "issue") and c("issue","cochange") are treated equal, which is why we should sort the unique vectors afterwards, to be on the safe side.

After executing all tests, I did not find any breaking sections in coronet (as of right now, I did not add any tests myself for this). Especially for the case you mentioned, im not sure why it should break. Manual inspection showed it working just fine 😄

Hm, I am not convinced yet. In the case mentioned above, the following line returns different values in different cases:

data.sources = unique(igraph::E(network)$relation) 

If every edge represents a single relation, then data.sources can be the following list: list("issue", "mail", "cochange") However, if we simplify multi-relation edges, it can be the following list: list(c("cochange", "issue"), c("issue"), c("issue","mail"), c("cochange", "issue", "mail")) In this case, we have four edges with four different unique relations, but this does not resemble the data sources, but a combination of them. I am not sure if the subsequent sapply call (see the cited code snippet above) can deal with these vectors that consist of multiple elements properly. Please check again what happens in this function in such a case.

maxloeffler commented 8 months ago

Oh by the way, I have fixed that one function affected by the changes to the relation attribute. I forgot that pushes to that branch do not notify anyone of you, but please look at this to oversee the fix. I have started to working on tests for the new simplification and will let you know more details in our next meeting 😄

bockthom commented 8 months ago

I had a quick look at the last two commits on the referenced branch. Looks good to me! Regarding the edge attribute handling, an additional idea comes to my mind now: Wouldn't it be possible to move function relation.handling directly to the constant EDGE.ATTR.HANDLING? If this would be possible, then we would make sure that this is handled the same way whenever EDGE.ATTR.HANDLING is used (wherever this will happen in the future), and we would get rid of the modifyList call here. Just an idea, not sure if it works, but it is worth a try :smile:

maxloeffler commented 8 months ago

Im not entirely sure, I understand what you mean. Obviously, we cannot move relation.handling into EDGE.ATTR.HANDLING's definition, as it is not always the way we want to work. So I assume you mean that at the point of simplifying with the simplify.multiple.relations parameter set, then we overwrite the default relation handling function with relation.handling?

If I assumed correctly: I used modifyList especially because it does not alter the EDGE.ATTR.HANDLING object but returns a new one, because I did not want to break things by accident. However, I see your point in that it would provide benefit when overriding it at this point. So I checked it out in code and found that even when overriding it by doing EDGE.ATTR.HANDLING$relation = relation.handling it will not persist. I tried having two consecutive calls to simplify.network in one test and even though overriding works in both calls individually, it does not persist into the next calls, i.e., in the second call to simplify.network, EDGE.ATTR.HANDLING will have returned to its default. I am not sure what R does there and if it can still be done, maybe you know more :).

bockthom commented 8 months ago

Obviously, we cannot move relation.handling into EDGE.ATTR.HANDLING's definition, as it is not always the way we want to work.

Why not? I don't see any situation at which changing the definition of EDGE.ATTR.HANDLING would break. Calling "sort" and "unique" should not change anything in the previously existing way of simplification. But maybe I overlook something that I don't have in mind - do you have a certain situation in mind in which replacing the definition would break?

maxloeffler commented 8 months ago

Well I supposed when simplifying multi-relational networks the classical way, but upon looking at the code, I think it would retain functionality. Apart from that, EDGE.ATTR.HANDLING is only ever accessed in one other location:

https://github.com/se-sic/coronet/blob/a0c3a5262a8837679e2c04343e10b2e41dcab4df/util-networks.R#L910-L915

I am not sure on the effects of changing the relation attribute in this call.

bockthom commented 8 months ago

According to the documentation of igraph::as.undirected, edge.attr.comb should only be effective if mode is "collapse" or "mutual". As "each" is used in this case, edge.attr.comb should not have any effect. Even when we would assume that we would use another mode here, I don't see any reason why the relation in artifact networks should be handled differently than in author networks.