r-lib / slider

Sliding Window Functions
https://slider.r-lib.org
Other
294 stars 12 forks source link

Write a small vignette on `.complete` and `slide_period_*()` family #202

Open wallyxie opened 4 months ago

wallyxie commented 4 months ago

Hi @DavisVaughan,

Per this Stack Overflow question, I am experiencing an issue where slide_period_dfr produces the same output including partial period calculations regardless of whether .complete is set to T or F. It looks like at least one other user was able to replicate this.

The issue can be replicated as follows:

library(lubridate)

set.seed(1)

dates <- ymd(parse_date("2023-12-31") - days(0:199))

colors <- c('red', 'blue')
sample_colors <- sample(colors, 200, replace = TRUE)
objects <- c('pen', 'marker', 'brush')
sample_objects <- sample(objects, 200, replace = TRUE)

test_df <- data.frame(dates, sample_colors, sample_objects)

period_count  <- function(dat) {
    dat |>
        add_count(sample_colors, sample_objects, name = "sub_total") |>
        summarise(
            earliest_day_of_period = min(dates),
            latest_day_of_period = max(dates),
            day_span = latest_day_of_period - earliest_day_of_period,
            min_object_n = min(sub_total)
        ) 
}

slider::slide_period_dfr(
    test_df,
    .i = test_df$dates,
    .period = "day",
    .f = period_count,
    .every = 60,
    .complete = TRUE,
    .origin = max(test_df$dates) +1
)

Running

test_df_period_counts <- slide_period_dfr(
  test_df,
  .i = test_df$dates,
  .period = "day",
  .f = period_count,
  .every = 60,
  .complete = TRUE,
  .origin = max(test_df$dates) +1
)

then produces

#   earliest_day_of_period latest_day_of_period day_span min_object_n
# 1             2023-08-15           2023-09-03  19 days            1
# 2             2023-09-04           2023-11-02  59 days            5
# 3             2023-11-03           2024-01-01  59 days            6
# 4             2024-01-02           2024-03-01  59 days            6

as does

test_df_period_counts <- slide_period_dfr(
  test_df,
  .i = test_df$dates,
  .period = "day",
  .f = period_count,
  .every = 60,
  .complete = FALSE,
  .origin = max(test_df$dates) +1
)

where the partial period and its .f operations are included.

Is this a bug, or does slide_period_dfr ignore the .complete argument?

Thank you for your time and attention!

DavisVaughan commented 3 months ago

Ok, I've looked into this a little to confirm that .complete is working as intended. With slide_period_*(), the .complete argument does work slightly differently then the rest of slider, but let me try and explain with a simpler example.

slide_period() works in two steps:

So in the example below of looking at "the current month plus 1 month before it", we first:

.complete is only taken into account during the 2nd bullet point. The way it works is that it asks the question: "Is it even technically possible to have any data in the current month bin AND the previous month bin?". In the result's 1st value, it is literally impossible to have anything in the "previous month bin" because bin 600 is the first one, so there is no previous month bin, so that is considered "incomplete" and you get a NULL there when .complete = TRUE.

Note that this is not the case for the 2020-04 bin assigned to number 603. Even though there is no data in the 2020-03 bin, there technically could be because we've seen bins 600 and 601 before it, so bin 602 could theoretically exist and give us a complete window. The way this is implemented ensures that you only see NULL incomplete bins at the front (or back, if using .after) of the result set, and not interspersed randomly throughout it. Again, in this case it is only returning NULL if it is technically impossible to have a complete bin based on the way the arguments are specified, regardless of the data.

library(slider)

i <- as.Date(c(
  "2020-01-01", "2020-01-05",
  "2020-02-02", "2020-02-04",
  "2020-04-01", "2020-04-07"
))

# "the current month, and 1 month before it"
slide_period(i, i, "month", identity, .before = 1)
#> [[1]]
#> [1] "2020-01-01" "2020-01-05"
#> 
#> [[2]]
#> [1] "2020-01-01" "2020-01-05" "2020-02-02" "2020-02-04"
#> 
#> [[3]]
#> [1] "2020-04-01" "2020-04-07"

# it is literally impossible for the first group to be "complete".
slide_period(i, i, "month", identity, .before = 1, .complete = TRUE)
#> [[1]]
#> NULL
#> 
#> [[2]]
#> [1] "2020-01-01" "2020-01-05" "2020-02-02" "2020-02-04"
#> 
#> [[3]]
#> [1] "2020-04-01" "2020-04-07"

# a good way to check what's happening is to look at the result of warp_distance(),
# used under the hood. this "chunks" the `i`ndex by `period`, and then `.before`,
# `.after`, and `.complete` are applied to this result
warp::warp_distance(i, "month")
#> [1] 600 600 601 601 603 603

So in the case of your example, it will:

Since .before and .after are both 0, you've requested it to slide over "just the current 60 day bin". It is technically possible for that to contain a full window of data even in the first result, so it returns the same thing regardless of .complete = TRUE/FALSE.


I should probably create a small vignette that talks about this example in more detail, as this is definitely one of the more complicated parts of slider, so I will leave this issue open to remind myself to do that, but it is working as intended