r-lib / slider

Sliding Window Functions
https://slider.r-lib.org
Other
295 stars 12 forks source link

Feature request: slide only over some indices? #164

Closed ryantibs closed 2 years ago

ryantibs commented 2 years ago

Hi @DavisVaughan, first let me offer my thanks for this package: it's really nice (and fast)! I've found it useful in an R package that we're building for epidemic signal processing in particular, where "sliding" is a kind of canonical template for a bunch of common computations.

I do have one feature request. It pertains to the slide_index() family (though the analogy could be built for the slide() family as well). It would be useful to have an extra argument available to specify only a subset of the indices .i over which we want to apply the given function/formula .f. The default value of this argument can correspond to sliding over all indices, as the current version does.

Note that sliding only over a particular subset of i cannot be simply achieved by filtering .x and .i before the call to slide_index(), because the calculation over that subset still might require the more rows (potentially the full set of rows) of .x.

As an example, think about .x as a data frame where the rows correspond to variables measured over time, and .i indexes the time values. Suppose we want to perform some expensive computation that depends on the most recent 120 days, but only perform this computation at 30 day intervals.

I would be very grateful if you could consider this. Thanks!

DavisVaughan commented 2 years ago

depends on the most recent 120 days, but only perform this computation at 30 day intervals

So this implies that you would not be returning something that is the same length as .x, since you wouldn't be computing .f length(.x) times. Being size stable is something that is required by slide() and slide_index() and friends, so this doesn't fit there.

But I know this is useful sometimes, so there is a second family of functions called hop() and hop_index() that allows for something like this. With this you get to create a custom set of bounds to apply .f over.

So I'd imagine that something like this would be useful for you:

library(tibble)
library(slider)
library(clock)

x <- sample(100, 500, replace = TRUE)
i <- as.Date("2019-01-01") + seq(0, length(x) - 1)

head(x)
#> [1] 15 92 37 26 78 42
head(i)
#> [1] "2019-01-01" "2019-01-02" "2019-01-03" "2019-01-04" "2019-01-05"
#> [6] "2019-01-06"

size <- 120L

# - Compute [start, stop] pairs with a range of ~120 days where possible
# - These are where you will compute your function between
# - Each `stop` date is 30 days apart
# - Compute the `stops` first to ensure your most recent datapoint is always included,
#   which seems desirable
stops <- rev(date_seq(from = max(i), to = min(i), by = -30))
starts <- pmax(stops - size, min(i))

# `hop_index_vec()` is the key here. It always returns something the same
# size as `starts` and `stops`, rather than something the same size as `x`.
result <- tibble(
  start = starts,
  stop = stops,
  n_days = as.integer(stop - start),
  value = hop_index_vec(x, i, start, stop, mean, .ptype = double())
)

result 
#> # A tibble: 17 × 4
#>    start      stop       n_days value
#>    <date>     <date>      <int> <dbl>
#>  1 2019-01-01 2019-01-20     19  56  
#>  2 2019-01-01 2019-02-19     49  50.6
#>  3 2019-01-01 2019-03-21     79  49.0
#>  4 2019-01-01 2019-04-20    109  49.5
#>  5 2019-01-20 2019-05-20    120  50.3
#>  6 2019-02-19 2019-06-19    120  50.5
#>  7 2019-03-21 2019-07-19    120  50.6
#>  8 2019-04-20 2019-08-18    120  50.6
#>  9 2019-05-20 2019-09-17    120  50.0
#> 10 2019-06-19 2019-10-17    120  49.9
#> 11 2019-07-19 2019-11-16    120  46.8
#> 12 2019-08-18 2019-12-16    120  46.7
#> 13 2019-09-17 2020-01-15    120  46.1
#> 14 2019-10-17 2020-02-14    120  47.4
#> 15 2019-11-16 2020-03-15    120  51.6
#> 16 2019-12-16 2020-04-14    120  48.9
#> 17 2020-01-15 2020-05-14    120  49.7

You could then use n_days to filter down to only windows that had the full 120 days of history available if you wanted to

ryantibs commented 2 years ago

Thanks for the pointers and the example. Very helpful. I do think I could use hop_index() to get the functionality I'm looking for in general. I'm sorry I didn't notice it in the first place!

Re size stable: I guess, you could make the extension I'm asking for to be size stable by just returning some flag (like NA) for the rows at which you didn't do any computations. But it seems like it's not worth it since hop_index() basically already provides the needed functionality.

A question if you don't mind me following up: is it obvious how you would enforce the analogy of .complete = TRUE with hop_index()? In your example you suggest computing n_days = stop - start and checking whether it's the desired 120 or not, but that's not equivalent to a "complete" window --- at least in the interpretation of a what a complete window means in my mind. A complete window in my mind means that you have n time points in it. You could still have stop - start = 120 but the data could have gaps in it.

Am I interpreting your definition of complete wrong? (It's not clear to me from the documentation for slide_index().) Is your definition of a complete window simply that it contains the min and max index values, and not necessarily all n index values? If so, then I'm aware that I could fill gaps with tsibble functionality, I'm just unsure of the definition of completeness.

ryantibs commented 2 years ago

So ... I can easily answer my own question, with the simple example at the bottom. It looks like complete is just based on having the right min and max in the local window (which explains the computed values for 2021-01-05, 2021-01-06, 2021-01-07 in the example).

I'm ready to close this issue since I can just use hop_index() for my desired functionality. Thanks again for the pointers and the nice package.

> library(slider)
> library(tibble)
> i <- seq(as.Date("2021-01-01"), as.Date("2021-01-10"), by = "1 day")
> a <- 1:length(i)
> tibble(i, slide_val = slide_index_dbl(a, i, ~ mean(.x), .before = 3, 
+                                       .complete = TRUE))
# A tibble: 10 × 2
   i          slide_val
   <date>         <dbl>
 1 2021-01-01      NA  
 2 2021-01-02      NA  
 3 2021-01-03      NA  
 4 2021-01-04       2.5
 5 2021-01-05       3.5
 6 2021-01-06       4.5
 7 2021-01-07       5.5
 8 2021-01-08       6.5
 9 2021-01-09       7.5
10 2021-01-10       8.5
> 
> i2 <- i[i != as.Date("2021-01-04")]
> a2 <- a[i != as.Date("2021-01-04")]
> tibble(i2, slide_val = slide_index_dbl(a2, i2, ~ mean(.x), .before = 3, 
+                                        .complete = TRUE))
# A tibble: 9 × 2
  i2         slide_val
  <date>         <dbl>
1 2021-01-01     NA   
2 2021-01-02     NA   
3 2021-01-03     NA   
4 2021-01-05      3.33
5 2021-01-06      4.67
6 2021-01-07      6   
7 2021-01-08      6.5 
8 2021-01-09      7.5 
9 2021-01-10      8.5 
DavisVaughan commented 2 years ago

size stable by just returning some flag (like NA) for the rows at which you didn't do any computations.

slide() has a .step argument which does exactly that. slide_index() doesn't and I'm almost 100% certain that I tried to add it but it didn't make sense for some reason, and hop_index() solved the problem enough that I didn't pursue it further.


About .complete. Yea, .complete is supposed to work somewhat the same for both slide() and slide_index(). The idea is that you first compute the window bounds, and if it is possible to generate a full window, then that is considered a "complete" window.

So in your example above 2021-01-03 - 3 days = 2020-12-31, but the min date is 2021-01-01, so it isn't possible to generate a full window. But 2021-01-05 - 3 days = 2021-01-02 so it is possible to generate a full window.

Whether or not there are any gaps in the window is a separate problem not tackled by .complete.