Optimizations for network-based core-peripheral classification

se-sic / coronet

coronet – the R library for configurable and reproducible construction of developer networks

GNU General Public License v2.0

7 stars 15 forks source link

Optimizations for network-based core-peripheral classification #148

Closed bockthom closed 5 years ago

bockthom commented 5 years ago

Description

Today, @SCPhantom, @clhunsen, and I figured out that different classification methods are currently not comparable as they are operating on different data:

commit-count classification includes commits to the empty artifact whereas the network-based classification methods don't. As a consequence, active developers (which contribute only to empty artifact) do not appear in any network and, hence, are not classified, even if they are peripheral.
network-based classification methods classify developers which are part of the current network but not really active (as authors of a commit) in the corresponding time period. (This case occurs when splitting networks. In that case, also in-active developers get part of the network.)

Solution

After discussing several options with @clhunsen, we came to the following decisions:

Add all active developers to the cochange author-network, even when they work only on the empty artifact, but don't construct edges when co-changing the empty artifact. As a consequence, developers only working on the empty artifact should appear in the classification and should be classified as peripheral.

[Hints for the implementation:

a) When constructing author cochange networks, use filtered data and NOT filtered.empty. To achieve this, we need to change the group.data.by.column the function here:

https://github.com/se-passau/codeface-extraction-r/blob/d38807b503158f8431b710da56ea280646c1bb5d/util-data.R#L1011-L1012

b) Do not construct edges among the empty artifact within the following function:

https://github.com/se-passau/codeface-extraction-r/blob/d38807b503158f8431b710da56ea280646c1bb5d/util-networks.R#L861-L862

]

Add a parameter restrict.classification.to.vertices to all network-based classification functions to specify the authors for which the class should be determined. For example, when passing a list of active developers (based on commit data) to this parameter, then only those developers should be classified as core resp. peripheral. All the other developers are still part of the network but are omitted from the classification. So, in the end, it is the user's decision, which developers to classify and which not. Hence, the user has to care about which cases matter in the network of interest.

[Hints for the implementation:

Add the described parameter and its functionality to the following functions:

get.author.class.network.degree
get.author.class.network.hierarchy
get.author.class.network.eigen

Question: Do we also need this parameter for non-network-based classification functions?

Afterwards, add the ... parameter to all functions that call the above stated functions internally and recursively go on by the functions which call those functions with the ... again... ]

Final remarks

I will probably work on the implementation of the decided solution as soon as I will have time for that (or assign that issue to anyone else otherwise).

If anyone has additional comments regarding the provided solution, just feel free to comment this issue :wink:

After the implementation is finished, everyone who uses network splitting (also known as edge-based splitting, or in combination with other splitting algorithms: hybrid splitting) should check whether (s)he should update the corresponding function calls.

clhunsen commented 5 years ago

Add all active developers to the cochange author-network, even when they work only on the empty artifact, but don't construct edges when co-changing the empty artifact. As a consequence, developers only working on the empty artifact should appear in the classification and should be classified as peripheral.

[Hints for the implementation:

a) When constructing author cochange networks, use filtered data and NOT filtered.empty. To achieve this, we need to change the group.data.by.column the function here:

https://github.com/se-passau/codeface-extraction-r/blob/d38807b503158f8431b710da56ea280646c1bb5d/util-data.R#L1011-L1012

b) Do not construct edges among the empty artifact within the following function:

https://github.com/se-passau/codeface-extraction-r/blob/d38807b503158f8431b710da56ea280646c1bb5d/util-networks.R#L861-L862

I gave this some thought while reading: Alternatively, we could just override the set of vertices in the following code with a union of the generated one (author.net.data[["vertices"]]) and a set that we retrieve from the commits.filtered data. Maybe, this idea leads to less invasive code changes in the end.

https://github.com/se-passau/codeface-extraction-r/blob/d38807b503158f8431b710da56ea280646c1bb5d/util-networks.R#L165-L170

bockthom commented 5 years ago

Add all active developers to the cochange author-network, even when they work only on the empty artifact, but don't construct edges when co-changing the empty artifact. As a consequence, developers only working on the empty artifact should appear in the classification and should be classified as peripheral.

I gave this some thought while reading: Alternatively, we could just override the set of vertices in the following code with a union of the generated one (author.net.data[["vertices"]]) and a set that we retrieve from the commits.filtered data. Maybe, this idea leads to less invasive code changes in the end. https://github.com/se-passau/codeface-extraction-r/blob/d38807b503158f8431b710da56ea280646c1bb5d/util-networks.R#L165-L170

Yeah, that sounds like a good idea! I really like this idea, I prefer this one over the one we had yesterday. The only question is, how to retrieve the vertices from the commits.filtered data. Something like that:

unique(project.data$get.commits.filtered()[["author.name"]])

Remark: Independently from the changes we discuss here, the TODO here is still there... https://github.com/se-passau/codeface-extraction-r/blob/d38807b503158f8431b710da56ea280646c1bb5d/util-data.R#L1011-L1012

bockthom commented 5 years ago

One additional remark on part 2:

Add a parameter restrict.classification.to.vertices to all network-based classification functions to specify the authors for which the class should be determined. For example, when passing a list of active developers (based on commit data) to this parameter, then only those developers should be classified as core resp. peripheral. All the other developers are still part of the network but are omitted from the classification. So, in the end, it is the user's decision, which developers to classify and which not. Hence, the user has to care about which cases matter in the network of interest.

The user has to keep outliers to have them in the network to analyze (for example, when splitting networks, remove.isolates has to be FALSE. I have to think about that whether this does have any other side effects or not...)

[Hints for the implementation:

Add the described parameter and its functionality to the following functions:

get.author.class.network.degree

get.author.class.network.hierarchy

get.author.class.network.eigen

Question: Do we also need this parameter for non-network-based classification functions?

Afterwards, add the ... parameter to all functions that call the above stated functions internally and recursively go on by the functions which call those functions with the ... again... ]

clhunsen commented 5 years ago

Yeah, that sounds like a good idea! I really like this idea, I prefer this one over the one we had yesterday. The only question is, how to retrieve the vertices from the commits.filtered data. Something like that:
unique(project.data$get.commits.filtered()[["author.name"]])

Yeah, this looks good. Maybe, we should construct a failsafe check whether the two sets of authors are in a subset relation. Probably, it is sufficient to remove the empty artifact from the list that we obtain here:

https://github.com/se-passau/codeface-extraction-r/blob/d38807b503158f8431b710da56ea280646c1bb5d/util-networks.R#L158

Remark: Independently from the changes we discuss here, the TODO here is still there...

https://github.com/se-passau/codeface-extraction-r/blob/d38807b503158f8431b710da56ea280646c1bb5d/util-data.R#L1011-L1012

Nice finding. If we address this TODO item as suggested, we definitely need to patch the construction of the co-change author network as suggested above to omit the empty artifact.

jkronaw commented 5 years ago

Sooo, coming now to the second main point:

Add a parameter restrict.classification.to.vertices to all network-based classification functions to specify the authors for which the class should be determined. For example, when passing a list of active developers (based on commit data) to this parameter, then only those developers should be classified as core resp. peripheral. All the other developers are still part of the network but are omitted from the classification. So, in the end, it is the user's decision, which developers to classify and which not. Hence, the user has to care about which cases matter in the network of interest.

Just to make sure, but I think your statement, @bockthom, says this clearly anyways: You mean, that in in a network the classification should be performed on a subgroup of authors. Thus, people of this subgroup could be classified as core even though they would not be core, when classification is performed on the whole group of authors in the network. So the parameter is no simple restriction of the classification results to the specified authors, but instead, it declares on which authors the classification should be performed. Is that correct?

And as response to your question:

Question: Do we also need this parameter for non-network-based classification functions?

For consistency, I suggest to also add this parameter to all other classification functions (like the commit count classification, etc.). Or is there any reason, why the restriction only makes sense for networks?

You also mentioned that we should keep outliers when we are splitting networks:

The user has to keep outliers to have them in the network to analyze (for example, when splitting networks, remove.isolates has to be FALSE. I have to think about that whether this does have any other side effects or not...)

I absolutely agree. Keeping the authors in the network, even if they were inactive in one period, is a good idea. I also suggest to change the default value of remove.isolates from TRUE to FALSE for the following functions, provided that these changes do not cause any serious trouble elsewhere (which I have not checked yet):

split.network.time.based
split.networks.time.based
split.network.activity.based
split.network.time.based.by.ranges
split.network.by.bins

One more question: Should the restrict.classification.to.vertices be a vector of strings or a vector of vertex ids? Or should both be allowed? And what should happen, when we choose a vertex which does not exist (because there is no vertex with the specified id or name)? Should this be simply ignored or should it throw an error?

bockthom commented 5 years ago

Thanks for your comments on that, @SCPhantom. Let me reply to each of them individually now:

Sooo, coming now to the second main point:

Add a parameter restrict.classification.to.vertices to all network-based classification functions to specify the authors for which the class should be determined. For example, when passing a list of active developers (based on commit data) to this parameter, then only those developers should be classified as core resp. peripheral. All the other developers are still part of the network but are omitted from the classification. So, in the end, it is the user's decision, which developers to classify and which not. Hence, the user has to care about which cases matter in the network of interest.

Just to make sure, but I think your statement, @bockthom, says this clearly anyways: You mean, that in in a network the classification should be performed on a subgroup of authors. Thus, people of this subgroup could be classified as core even though they would not be core, when classification is performed on the whole group of authors in the network. So the parameter is no simple restriction of the classification results to the specified authors, but instead, it declares on which authors the classification should be performed. Is that correct?

Yes, that is correct (at least at my point of view -- what do you think @clhunsen ?).

Let's have a short example: The network may consist of the four nodes A B C D, but A is only part of the network as it was active in the time range before and we now have a new edge B -> A. So what I would like to have is a classification of all the active developers, which might be B C D in this case. So, the user who calls the classification function has to specify, which developers should be classfied. In our case, the user should pass B C D as parameter to the classification function. When we determine the classification value for each node (e.g., degree centrality), we get something like this:

A 1
B 2
C 2
D 1

What I would like to do here is just omitting A and then perform the (usual 20/80) classification of the rest:

B 2
C 2
D 1

which would classify B and C as core and D as peripheral.

Any other ideas?

And as response to your question:

Question: Do we also need this parameter for non-network-based classification functions?

For consistency, I suggest to also add this parameter to all other classification functions (like the commit count classification, etc.). Or is there any reason, why the restriction only makes sense for networks?

Good question. The reason why we need the restriction for networks is the way we have implemented network-based splitting: When we split networks (not data), we just look at the dates at the edges. If, for instance, developer X today changes a file that was changed by developer Y 10 years before, we get an edge X -> Y with today's time stamp (if we use respect.temporal.order = TRUE, for example). So, when we split the network into 3-month-range networks, we will have both X and Y in the last network as the edge X -> Y has today's time stamp as date. Hence, Y is part of the last network even if it was inactive for the last 10 years. That's why we have this issue and why we want to restrict the list of nodes. For data-based classification, we (currently) only rely on time stamps of certain activities which do not automatically involve other (in-active) authors.

So I asked my previous question to find out if there would be also use-cases where someone needs to restricts also the authors on data-based classification (e.g., commit-count classification).

So, consistency is always a good point, but a real use-case would also be nice ;)

You also mentioned that we should keep outliers when we are splitting networks:

The user has to keep outliers to have them in the network to analyze (for example, when splitting networks, remove.isolates has to be FALSE. I have to think about that whether this does have any other side effects or not...)

I absolutely agree. Keeping the authors in the network, even if they were inactive in one period, is a good idea. I also suggest to change the default value of remove.isolates from TRUE to FALSE for the following functions, provided that these changes do not cause any serious trouble elsewhere (which I have not checked yet):

Hm. On the one hand, I like the idea (because then the end user would not have to deal with this any more as all nodes are there even if they are isolates so we don't loose information by default). On the other hand, changing the default value of remove.isolates could potentially cause lots of trouble elsewhere... There are lots of scripts which use this function and I am not sure which assumptions where made before regarding the removal of isolates. Also several test cases may be affected. @clhunsen What's your opinion on that?

* `split.network.time.based`

* `split.networks.time.based`

* `split.network.activity.based`

* `split.network.time.based.by.ranges`

* `split.network.by.bins`
One more question: Should the restrict.classification.to.vertices be a vector of strings or a vector of vertex ids? Or should both be allowed?

As far as I know, we always use names. So, it is enough (for me, at least) to allow only a vector of strings.

And what should happen, when we choose a vertex which does not exist (because there is no vertex with the specified id or name)? Should this be simply ignored or should it throw an error?

My suggestion: Neither. I would just print a warning that a specific author is not part of the network, but simply perform the classification on the remaining existing nodes.

jkronaw commented 5 years ago

Good question. The reason why we need the restriction for networks is the way we have implemented network-based splitting: When we split networks (not data), we just look at the dates at the edges. If, for instance, developer X today changes a file that was changed by developer Y 10 years before, we get an edge X -> Y with today's time stamp (if we use respect.temporal.order = TRUE, for example). So, when we split the network into 3-month-range networks, we will have both X and Y in the last network as the edge X -> Y has today's time stamp as date. Hence, Y is part of the last network even if it was inactive for the last 10 years. That's why we have this issue and why we want to restrict the list of nodes. For data-based classification, we (currently) only rely on time stamps of certain activities which do not automatically involve other (in-active) authors.

Thanks for that explanation, that really makes it clear!

So I asked my previous question to find out if there would be also use-cases where someone needs to restricts also the authors on data-based classification (e.g., commit-count classification).

So, consistency is always a good point, but a real use-case would also be nice ;)

One use-case, which I could imagine, would be the following:

I split a network into 3-month windows
Only the last network of the splitted networks is considered further
I restrict authors to a certain subgroup and classify on this subgroup
Then I want to compare the classification results with the results from the commit count classification
To do that, I also split the commits into 3-month windows
Only the last 3-month window of commits is considered further
And then I would need to perform a classification on the same subgroup, don't I? This applies especially when the subgroup, that I chose to restrict to, is not exactly the subgroup consisting of all authors that authored a commit in this last 3-month window, but when the subgroup is indeed smaller.

I hope I could explain that good enough.

clhunsen commented 5 years ago

Just to make sure, but I think your statement, @bockthom, says this clearly anyways: You mean, that in in a network the classification should be performed on a subgroup of authors. Thus, people of this subgroup could be classified as core even though they would not be core, when classification is performed on the whole group of authors in the network. So the parameter is no simple restriction of the classification results to the specified authors, but instead, it declares on which authors the classification should be performed. Is that correct?

Yes, that is correct (at least at my point of view -- what do you think @clhunsen ?).

I can only agree.

You also mentioned that we should keep outliers when we are splitting networks:

The user has to keep outliers to have them in the network to analyze (for example, when splitting networks, remove.isolates has to be FALSE. I have to think about that whether this does have any other side effects or not...)

I absolutely agree. Keeping the authors in the network, even if they were inactive in one period, is a good idea. I also suggest to change the default value of remove.isolates from TRUE to FALSE for the following functions, provided that these changes do not cause any serious trouble elsewhere (which I have not checked yet):

Hm. On the one hand, I like the idea (because then the end user would not have to deal with this any more as all nodes are there even if they are isolates so we don't loose information by default). On the other hand, changing the default value of remove.isolates could potentially cause lots of trouble elsewhere... There are lots of scripts which use this function and I am not sure which assumptions where made before regarding the removal of isolates. Also several test cases may be affected. @clhunsen What's your opinion on that?

I am strongly against a change of the default value. There are two main reasons for me:

The default use case has always been to split networks and data as consistently as possible. This means: When we remove commit information, for example, such that an author does not appear in the resulting data set anymore, then we would not want to include this inactive author in the split network either.

I know that we have such range-spanning edges that keep some inactive authors in the network, but those are the exception: They are absolutely indistinguishable from those that are just the first ones in e-mail threads, for example (when respecting temporal order). Thus, we cannot do anything about them without having a look at the data in parallel.
When introducing these parameters remove.isolates (https://github.com/se-passau/codeface-extraction-r/commit/d4b8ff9548f09db45e535e4634405e0a75211761 and issue #99), we kept the previous default already, i.e., setting the parameters to TRUE by default.

So I asked my previous question to find out if there would be also use-cases where someone needs to restricts also the authors on data-based classification (e.g., commit-count classification). So, consistency is always a good point, but a real use-case would also be nice ;)

One use-case, which I could imagine, would be the following:

I split a network into 3-month windows

Only the last network of the splitted networks is considered further

I restrict authors to a certain subgroup and classify on this subgroup

Then I want to compare the classification results with the results from the commit count classification

To do that, I also split the commits into 3-month windows

Only the last 3-month window of commits is considered further

And then I would need to perform a classification on the same subgroup, don't I? This applies especially when the subgroup, that I chose to restrict to, is not exactly the subgroup consisting of all authors that authored a commit in this last 3-month window, but when the subgroup is indeed smaller.

I hope I could explain that good enough.

Sounds reasonable. To rephrase it: To analyze a subgroup of developers only, a restriction is needed. And we need this restriction being possible for all classification types to stay consistent.

Should the restrict.classification.to.vertices be a vector of strings or a vector of vertex ids? Or should both be allowed?

As far as I know, we always use names. So, it is enough (for me, at least) to allow only a vector of strings.

Yes, we always used names everywhere. For vertices, it should also be possible to pass vertex sequences (analogously to the function igraph::delete.vertices which allows both, too) without advertising this too broadly.

And what should happen, when we choose a vertex which does not exist (because there is no vertex with the specified id or name)? Should this be simply ignored or should it throw an error?

My suggestion: Neither. I would just print a warning that a specific author is not part of the network, but simply perform the classification on the remaining existing nodes.

I agree with @bockthom on this, a warning is sufficient. At this moment, I think about constructing the subgroup of developers only once and then passing the list to the classification functions along with a set of networks. Then, we may run into this "problem" very likely.

clhunsen commented 5 years ago

And what should happen, when we choose a vertex which does not exist (because there is no vertex with the specified id or name)? Should this be simply ignored or should it throw an error?

My suggestion: Neither. I would just print a warning that a specific author is not part of the network, but simply perform the classification on the remaining existing nodes.

I agree with @bockthom on this, a warning is sufficient. At this moment, I think about constructing the subgroup of developers only once and then passing the list to the classification functions along with a set of networks. Then, we may run into this "problem" very likely.

I talked to @bockthom just a moment ago and we realized that an active developer who is not present in the network needs to be added to the classification (although with a value of NA)!

jkronaw commented 5 years ago

I talked to @bockthom just a moment ago and we realized that an active developer who is not present in the network needs to be added to the classification (although with a value of NA)!

What do you mean by an active developer which is not present in the network? Isn't he an in-active developer then? Or dou you mean a developer which was specified in the restrict.classification.to.vertices parameter?

hechtlC commented 5 years ago

@SCPhantom I guess what @bockthom and @clhunsen mean are developers that for example do write mails but have no commits in the repository or vice versa. Then they should be considered in the classification although they do not appear in the respective network. Am I right @bockthom and @clhunsen?

clhunsen commented 5 years ago

Or dou you mean a developer which was specified in the restrict.classification.to.vertices parameter?

Yes. The idea is that we need to add all developer that are not part of the data/network used for the classification needs to be added to the classification result if and when they are specified in the new parameter restrict.classification.to.vertices. Their default value is NA.

@SCPhantom I guess what @bockthom and @clhunsen mean are developers that for example do write mails but have no commits in the repository or vice versa. Then they should be considered in the classification although they do not appear in the respective network. Am I right @bockthom and @clhunsen?

Yes, this is a good scenario why we can have such a situation where restrict.classification.to.vertices may contain vertices(developers that are not present in the data/network that is to be used for classification.

Sorry for my confusing words. I was still captured in the discussion with @bockthom when writing this. :wink:

jkronaw commented 5 years ago

I have added to discussed functionality so far.

Could you please take a look at the changes? As soon as I know, that it is ok for the most part, I'm going to add the documentation and comments, as well as adjusting the changelog and the readme.

bockthom commented 5 years ago

Thanks for your changes. I will have a closer look at them tomorrow.

One thing I have recognized already: We now have 2019, not any more 2018 📅 .

https://github.com/SCPhantom/codeface-extraction-r/commit/9ddb0c04e4c3b537a1c34c541fa1606276e7f37a#diff-97d59f43cc448c0006be81b48722a7b9R21

I will get back to you tomorrow.

jkronaw commented 5 years ago

One thing I have recognized already: We now have 2019, not any more 2018

Oh lol, I guess that won't be the last time happening 😂

bockthom commented 5 years ago

In general, your changes look very good @SCPhantom. I also talked to @clhunsen and we found some little issues that need to be fixed.

The name of the function append.inactive.authors and the wording within this function is slightly to specific. It could be the case that those restrictions are not made with respect to active or inactive authors. Suggestion: Something like consolidate.authors or similar (however, there is still room for improvement). In your comments and variable names, please also don't use "active" or "inactive", but rather describe that we consider the authors which are not part of the network but part of the given restriction list.

Maybe we could also make the restriction parameter explicit to the top level classification functions to ease documenting it.

There are some additional issues which are related to issue #70 -- that's why I will comment on those there.

jkronaw commented 5 years ago

The name of the function append.inactive.authors and the wording within this function is slightly to specific. It could be the case that those restrictions are not made with respect to active or inactive authors. Suggestion: Something like consolidate.authors or similar (however, there is still room for improvement). In your comments and variable names, please also don't use "active" or "inactive", but rather describe that we consider the authors which are not part of the network but part of the given restriction list.

I took care of this in a different way. I think that the new solution is the one that uses the minimum of redundant code without changing the interfaces. 😉

This also relates to issue #70.

Maybe you can take a look to give me the first feedback. I have not yet tested it all out in detail but it should work! Documentation is coming later then.

clhunsen commented 5 years ago

I have had a look into your commits a moment ago. In general, your changes look good so far.

I added two individual comments directly on the commits: https://github.com/SCPhantom/codeface-extraction-r/commit/c7f07c2b42c31b9be9069d81444792c031fc6525#r32183363 and https://github.com/SCPhantom/codeface-extraction-r/commit/5569995a51768076ae812e9cd2b449f3b73ba6aa#r32183262.

bockthom commented 5 years ago

Merged in #156.