psolymos / intrval

Relational Operators for Intervals
41 stars 4 forks source link

better operators for interval-to-interval relations #6

Closed mgacc0 closed 7 years ago

mgacc0 commented 7 years ago

It would be really great to implement some new operators for interval-to-interval overlapping.

Instead of having the %[o]% operator for checking the overlapping between the closed intervals [a, b] and [x, y] expressed as c(a, b) %[o]% c(x, y), it would be great having the %[]o[]% operator expressed as c(a, b) %[]o[]% c(x, y).

And then, for example:

c(1, 2) %[]o[]% c(2, 3)
# TRUE

c(2, 3) %[]o[]% c(1, 2)
# TRUE

c(1, 2) %()o()% c(2, 3)
# FALSE

c(2, 3) %()o()% c(1, 2)
# FALSE

Consequently, that would imply creating 32 operators (for all the combinations of open/closed endpoints):

interval-to-interval operator
%[]o[]%
%[]o[)%
%[]o(]%
%[]o()%
%[)o[]%
%[)o[)%
%[)o(]%
%[)o()%
... (continues)
psolymos commented 7 years ago

It crossed my mind to implement these, however, here is why I did not in the 1st go round:

  1. I found the %[]o[]% notation too long (almost looks like a TIE fighter), but it is kind of unavoidable.
  2. It is a lot of functions (4x4=16) that needs a lot of unit-tests.
  3. I did not see an immediate use case.

Out of these, point 3 is the most important, the other 2 are just stylistic/technical details. Do you have any suggestions on point 3?

The symmetric cases could be shortened, e.g. %[]o[]%=%[o]%, %()o()%=%(o)%.

mgacc0 commented 7 years ago
  1. The notation would be a bit long but, anyway, is intuitive (and quite equivalent to the mathematical notation). So that would make it easy to adopt.

  2. They are 16 functions. (I don't know why I wrote 32 previously...)

  3. In administrative data processing we use a lot of datetime ranges. And it's very usual to write them as: "from 01.12.2016 to 08.12.2016", meaning the semiopen interval ["2016-12-01 00:00:00", "2016-12-08 00:00:00").

    It would be a bit tedious:

    • having to write ["2016-12-01 00:00:00", "2016-12-07 23:59:59"]
    • or having to make two comparisons: one for overlapping with closed ranges and the next to check for the coincidence of endpoints
mgacc0 commented 7 years ago

I propose something like this:

.intrval3 <-
  function(interval1, type1, interval2, type2)
  {
    ab <- .get_intrval(interval1)
    cd <- .get_intrval(interval2)

    type1 <- match.arg(type1, c("[]", "[)", "(]", "()"))
    type2 <- match.arg(type2, c("[]", "[)", "(]", "()"))

    if (ab$a > cd$a) {
      tmp <- ab
      ab <- cd
      cd <- tmp

      tmp <- type1
      type1 <- type2
      type2 <- tmp
    }

    ! .lssthan(ab$b, cd,
               ifelse(substr(type1, 2L, 2L)=="]" & substr(type2, 1L, 1L)=="[",
                      "[", "("))
  }

.intrval3(c(1, 2), "()", c(2, 3), "()")
#  [1] FALSE

.intrval3(c(1, 2), "[]", c(2, 3), "[)")
#  [1] TRUE
psolymos commented 7 years ago

OK, I will make a branch and start development with your proposal.

psolymos commented 7 years ago

@mgacc0 see here what I have so far in overlap-operators branch: https://github.com/psolymos/intrval/blob/overlap-operators/extras/playground.R

psolymos commented 7 years ago

@mgacc0 I could test your function and it passed all the comparisons I came up with. The only issue right now is that it works for simple case but not yet vectorized (i.e. cannot accept matrices, lists because of the if (iv1$a > iv2$a) {...} statement).

mgacc0 commented 7 years ago

Firstly, a necessary correction: checking for empty intervals.

is_empty_interval <- function(interval, type) {
  interval[[1]]==interval[[2]] & type=="()"
}

overlaps <- function (interval1, interval2, type1, type2)
{
    stopifnot(is.vector(interval1) && length(interval1)<=2)
    stopifnot(is.vector(interval2) && length(interval2)<=2)

    iv1 <- .get_intrval(interval1)
    iv2 <- .get_intrval(interval2)

    type1 <- match.arg(type1, c("[]", "[)", "(]", "()"))
    type2 <- match.arg(type2, c("[]", "[)", "(]", "()"))

    if (is_empty_interval(iv1, type1) | is_empty_interval(iv2, type2)) {
      return(FALSE)
    }

    if (iv1$a > iv2$a) {
        tmp <- iv1
        iv1 <- iv2
        iv2 <- tmp

        tmp <- type1
        type1 <- type2
        type2 <- tmp
    }

    !.lssthan(iv1$b, iv2,
        ifelse(substr(type1, 2L, 2L)=="]" && substr(type2, 1L, 1L)=="[",
        "[", "("))
}

I would rename the .intrval3 function to overlaps and make it visible.


For easier testing, I would recommend the testthat package:

library(testthat)
expect_false(overlaps(c(2, 4), c(2, 2), "[]", "()"))
expect_false(overlaps(c(2, 2), c(2, 4), "()", "[]"))

For calling over dataframes, you could do something like this:

library(dplyr, purrr)

mm <- expand.grid(
  iv1_a=c(1, 2, 3, 4, 5),
  iv1_b=c(1, 2, 3, 4, 5),
  iv2_a=c(1, 2, 3, 4, 5),
  iv2_b=c(1, 2, 3, 4, 5),
  type1=c("[]", "[)", "(]", "()"),
  type2=c("[]", "[)", "(]", "()"),
  stringsAsFactors=FALSE)
nrow(mm)
# [1] 10000

mm %>%
  sample_n(4)
#      iv1_a iv1_b iv2_a iv2_b type1 type2
# 694      4     4     3     1    [)    []
# 612      2     3     5     5    []    []
# 1320     5     4     3     1    (]    []
# 6273     3     5     1     1    (]    (]

mm %>%
  sample_n(10) %>%
  by_row(~ overlaps(c(.[[1]], .[[2]]), c(.[[3]], .[[4]]),
                    .[[5]], .[[6]]), .collate="cols")
# # A tibble: 10 × 7
#    iv1_a iv1_b iv2_a iv2_b type1 type2  .out
#    <dbl> <dbl> <dbl> <dbl> <chr> <chr> <lgl>
# 1      5     2     2     2    (]    []  TRUE
# 2      1     4     4     5    ()    () FALSE
# 3      3     5     2     5    ()    []  TRUE
# 4      4     2     4     3    (]    (]  TRUE
# 5      5     3     2     1    (]    (] FALSE
# 6      2     3     2     4    (]    []  TRUE
# 7      2     1     4     2    ()    [) FALSE
# 8      5     2     1     4    (]    (]  TRUE
# 9      4     1     2     5    [)    []  TRUE
# 10     4     2     3     5    (]    [)  TRUE

mm %>%
  .[, 1:4] %>%
  sample_n(10) %>%
  by_row(~ c(.[[1]], .[[2]]) %[]o[]% c(.[[3]], .[[4]]), .collate="cols")
# # A tibble: 10 × 5
#    iv1_a iv1_b iv2_a iv2_b  .out
#    <dbl> <dbl> <dbl> <dbl> <lgl>
# 1      5     3     3     4  TRUE
# 2      3     1     2     2  TRUE
# 3      3     2     5     4 FALSE
# 4      3     2     2     4  TRUE
# 5      2     3     4     2  TRUE
# 6      3     4     2     5  TRUE
# 7      3     3     3     3  TRUE
# 8      3     3     4     1  TRUE
# 9      1     5     2     3  TRUE
# 10     2     5     2     4  TRUE

The notation for single vectors (intervals) is intuitive. But I think that the intrval library doesn't need to provide an operator for data.frames. You could place an example (on the README) of using the function purrr::by_row and overlaps.


Anyway, if you want to know how a function for data.frames would be:

intrval4 <- function (df1)
{
  stopifnot(is.data.frame(df1))
  stopifnot(nrow(df1)==1)

  iv1 <- .get_intrval(c(df1$iv1_a, df1$iv1_b))
  iv2 <- .get_intrval(c(df1$iv2_a, df1$iv2_b))

  type1 <- match.arg(df1$type1, c("[]", "[)", "(]", "()"))
  type2 <- match.arg(df1$type2, c("[]", "[)", "(]", "()"))

  if (is_empty_interval(iv1, type1) | is_empty_interval(iv1, type1)) {
    return (FALSE)
  }

  if (iv1$a > iv2$a) {
    tmp <- iv1
    iv1 <- iv2
    iv2 <- tmp

    tmp <- type1
    type1 <- type2
    type2 <- tmp
  }

  !.lssthan(iv1$b, iv2,
            ifelse(substr(type1, 2L, 2L)=="]" && substr(type2, 1L, 1L)=="[",
                   "[", "("))
}

mm %>%
  sample_n(20) %>%
  by_row(intrval4, .collate=c("cols"))
psolymos commented 7 years ago

Thanks. I just wanted to make the operators consistent with the %[o]% cases that can accept different formats on both ends and recycle values if needed. I agree it is not absolutely necessary for all use cases, but still nice to have.

I like your testing approach, but want to avoid too much dependencies. Even if I ignore /tests for CRAN versions, CI will complain about missing Suggests declarations.

psolymos commented 7 years ago

This function seems to do the job:

.intrval3 <-
function(interval1, interval2, type1, type2)
{
    iv1 <- .get_intrval(interval1)
    iv2 <- .get_intrval(interval2)

    type1 <- match.arg(type1, c("[]", "[)", "(]", "()"))
    type2 <- match.arg(type2, c("[]", "[)", "(]", "()"))

    b1 <- ifelse(iv1$a < iv2$a, iv1$b, iv2$b)
    a2 <- ifelse(iv1$a < iv2$a, iv2$a, iv1$a)
    type1v <- ifelse(iv1$a < iv2$a, substr(type1, 2L, 2L), substr(type2, 2L, 2L))
    type2v <- ifelse(iv1$a < iv2$a, substr(type2, 1L, 1L), substr(type1, 1L, 1L))

    ifelse(type1v == "]" & type2v == "[",
        b1 >= a2,
        b1 > a2)
}

The idea is similar to your proposal:

  1. sort intervals based on lower endpoints (a1 < a2)
  2. compare b1 to a2 based on open/closed endpoints.
mgacc0 commented 7 years ago

Could you have a look at ../pull/13 ?

psolymos commented 7 years ago

@mgacc0 I merged the overlap-operators branch, see comments on #15

Next steps include more testing and improving code coverage (it dropped because right now only internals are tested).