tidyverse / funs

Collection of low-level functions for working with vctrs
Other
34 stars 7 forks source link

Implement ilag() and ilead() #46

Open DavisVaughan opened 4 years ago

DavisVaughan commented 4 years ago

Related to #34

These are variations on lead() and lag() that require an order_by argument, but also respect the "spacing" between order_by observations.

This is very useful for time series, and is a neat feature in Stata. See slides 10-13 https://www.princeton.edu/~otorres/TS101.pdf

Also think about

Implementation:

library(vctrs)
library(rlang)

ilag_ilead_impl <- function(x, order_by, n, default, fn) {
  vec_assert(x)
  vec_assert(order_by)

  vec_assert(n, size = 1L)
  n <- vec_cast(n, integer(), x_arg = "n")

  x_size <- vec_size(x)
  order_by_size <- vec_size(order_by)

  if (x_size != order_by_size) {
    abort("`x` and `order_by` must have the same size.")
  }

  # vec_any_na()! vctrs#544
  if (any(vec_equal_na(order_by))) {
    abort("`order_by` cannot have `NA` values.")
  }

  if (x_size == 0L) {
    return(x)
  }

  order_by_shift <- fn(order_by, n)

  loc <- vec_match(order_by_shift, order_by)

  out <- vec_slice(x, loc)

  if (!is.null(default)) {
    na_loc <- vec_equal_na(loc)
    default <- vec_cast(default, x, x_arg = "default", to_arg = "x")

    vec_slice(out, na_loc) <- default
  }

  out
}

ilag <- function(x, order_by, n = 1L, default = NULL) {
  ilag_ilead_impl(x, order_by, n, default, `-`)
}

ilead <- function(x, order_by, n = 1L, default = NULL) {
  ilag_ilead_impl(x, order_by, n, default, `+`)
}

Usage:

library(dplyr)

df <- tibble(
  x = c(5, 6, 7, 8),
  i = as.Date("2019-01-01") + c(0, 1, 3, 4)
)

# Notice how the temporal spacing is respected
# We get an `NA` at 2019-01-04 because 2019-01-03 doesn't exist
df %>%
  mutate(
    x_lag = lag(x),
    x_ilag = ilag(x, i)
  )
#> # A tibble: 4 x 4
#>       x i          x_lag x_ilag
#>   <dbl> <date>     <dbl>  <dbl>
#> 1     5 2019-01-01    NA     NA
#> 2     6 2019-01-02     5      5
#> 3     7 2019-01-04     6     NA
#> 4     8 2019-01-05     7      7

# - lag()'s default doesn't respect ordering of any variable
# - lag(order_by) respects ordering but not spacing
# - ilag(order_by) respects ordering and spacing
df_rev <- arrange(df, desc(i))

df_rev %>%
  mutate(
    x_lag = lag(x),
    x_lag_ob = lag(x, order_by = i),
    x_ilag = ilag(x, i)
  )
#> # A tibble: 4 x 5
#>       x i          x_lag x_lag_ob x_ilag
#>   <dbl> <date>     <dbl>    <dbl>  <dbl>
#> 1     8 2019-01-05    NA        7      7
#> 2     7 2019-01-04     8        6     NA
#> 3     6 2019-01-02     7        5      5
#> 4     5 2019-01-01     6       NA     NA

One thought was to let lag() have a respect_spacing parameter, rather that creating a new function. But I think it needs to be a new function, because there are restrictions on the order_by of ilag() that require that it has to be integerish under the hood, which is not a restriction on lag(). Practically, if we had a respect_spacing parameter, a problem would show up with character order_by variables. It would be strange for the usage of respect_spacing to stop this from working:

lag(1:3, order_by = c("a", "b", "c"))
# [1] NA  1  2

lag(1:3, order_by = c("a", "b", "c"), respect_spacing = TRUE)
# Error in order_by - n : non-numeric argument to binary operator

CC @earowang for the original inspiration of the functions. I think you could keep keyed_lag(), which could call this internally. I was excited by your implementation, and thought that it could be useful outside of the tsibble / time series context as well.