se-sic / coronet

coronet – the R library for configurable and reproducible construction of developer networks
GNU General Public License v2.0
7 stars 15 forks source link

`construct.ranges` for cumulative ranges #265

Open bockthom opened 4 months ago

bockthom commented 4 months ago

Description

In coronet, we have a function construct.ranges that takes a list of revisions and creates range names out of it, as in the following example:

> bins = c("2020-01-01", "2020-04-01", "2020-07-01", "2020-10-01", "2020-12-31")
> construct.ranges(bins, sliding.window = FALSE)
[1] "2020-01-01-2020-04-01" "2020-04-01-2020-07-01" "2020-07-01-2020-10-01" "2020-10-01-2020-12-31"

This function is able to construct sliding-window ranges, but not to construct cumulative ranges.

We have a dedicated function construct.cumulative.ranges, but this function has a completely different interface (it takes a start date, an end date, and a time period), similar to construct.consecutive.ranges and construct.overlapping.ranges. However, the function construct.ranges itself (which takes just a vector of dates) is not capable of constructing cumulative.

Therefore, I suggest to enhance the function construct.ranges by an additional parameter to construct cumulative ranges, or ––if adding a new parameter introduces more problems than benefits––also an additional function might be helpful - but then we have the problem of naming conflicts with the existing functions. So, I'd be glad if we find a suitable way to enhance the existing function construct.ranges.

Desired output for construct.ranges with cumulative ranges:

[1] "2020-01-01-2020-04-01" "2020-01-01-2020-07-01" "2020-01-01-2020-10-01" "2020-01-01-2020-12-31"

Motivation

Constructing ranges in a cumulative way is particularly useful when analyzing commit-interaction data, but also in many other use cases. In general, enhancing the currently existing function would provide an easy way to construct range-data objects cumulatively by simply passing a list of fixed bins to the range-construction function, and passing the resulting ranges to split.data.time.based.by.ranges afterwards.

maxloeffler commented 2 months ago

I have done a prototype implementation here and some tests here. Especially, let me know whether sliding windows and cumulative ranges are mutually, I would need to slightly update my implementation in that case.

Edit: Also let me know if this addition fits for you with my currently open wish-wash PR or if we should wait for a new one.

bockthom commented 2 months ago

I have done a prototype implementation

The implementation looks good to me (except for two typos/inconsistencies).

and some tests

The structure of the tests looks good, but I did not have time yet to find out whether the behavior in the tests is correct or not.

Especially, let me know whether sliding windows and cumulative ranges are mutually

I've seen that you have already tests for the combination of cumulative ranges and sliding windows - but just from looking at the tests I cannot judge whether such a combination is useful or not. Could you please post a small example directly showing how the ranges look like in such a case?

Also let me know if this addition fits for you with my currently open wish-wash PR or if we should wait for a new one.

If the implementation stays as small as it is currently, I'd go for adding it to your "open wish-wash PR". But let's discuss this tomorrow.

maxloeffler commented 2 months ago

Regarding sliding window ranges and our recent discussion. sliding.windows in construct.ranges is differs in some way from what we understand by sliding.windows in splitting.

Now regarding the cumulative ranges that means the following (example):

We want to split data into the following bins: 2016-01-01 - 2017-01-01, 2017-01-01 - 2018-01-01, and 2018-01-01 - 2019-01-01. We also specify sliding.windows = TRUE and therefore in the end receive network(-split)s that have the following bounds: 2016-01-01 - 2017-01-01, 2016-07-01 - 2017-07-01, 2017-01-01 - 2018-01-01, 2017-07-01 - 2018-07-01 and 2018-01-01 - 2019-01-01. (which is also exactly the output of construct.ranges(..., sliding.window = TRUE).). When we construct ranges and specify to construct cumulative ranges, all resulting ranges start with the start of the earliest range, i.e., the resulting ranges would be 2016-01-01 - 2017-01-01, 2016-01-01 - 2017-07-01, 2016-01-01 - 2018-01-01, 2016-01-01 - 2018-07-01 and 2016-01-01 - 2019-01-01.

Taking everything into account, I think cumulative sliding-window ranges may be as useful as cumulative regular ranges, depending on the use case, but im not entirely sure ^^

bockthom commented 2 months ago

Ok, let's keep the case to construct cumulative ranges for sliding-window ranges. (I don't think that this will be actually used; but, in general, the resulting ranges look reasonable).