Support relative intervals with both beginning and end

larsentom commented 1 year ago

Scenario: In clinical data we often want to look at values in a specific time range relative to the current value - for example, the most recent value between 7 days and 14 days ago.

Feature request: A method to allow specification of a relative interval, e.g. with .interval_relative_start / .interval_relative_stop (or some better-named parameters).

# Set up demo data  - - - - -

library(tidyverse)

set.seed(31415)

lab_results <- tibble(
  patient_id = sample(seq(1, 100), size = 10000, replace = TRUE),
  lab_date = lubridate::dmy("1 Jan 2020") + days(sample(seq(1, 365*3), size = 10000, replace = TRUE)),
  Hb_g_dL = runif(n = 10000, min = 10, max = 20)
) %>% distinct() %>% arrange(lab_date)

# How this could work - - - - -

lab_results %>%
  arrange() %>%
  group_by(patient_id) %>%
  mutate(last_Hb_value_from_7_to_14_days_ago = 
    slide_index(
      .x = cur_data(),
      .i = cur_data()$i,
      .f = function(data) { last(data$Hb_g_dL) },
      .interval_relative_start = -14, # NEW
      .interval_relative_stop = -7,   # NEW
      .complete = TRUE
    )) %>%
  ungroup()

DavisVaughan commented 1 year ago

I'm fairly certain the .before and .after arguments do what you want here.

You typically use .before to extend the window back to N days "before" the current index value, and .after to extend the window forward to N days "after" the current index value. However, you are absolutely permitted to use a negative .after value, i.e.:

[14 days before, -7 days after] == [14 days before, 7 days before]

library(tidyverse)
library(slider)
library(lubridate)

set.seed(31415)

lab_results <- tibble(
  patient_id = sample(seq(1, 100), size = 10000, replace = TRUE),
  lab_date = dmy("1 Jan 2020") + days(sample(seq(1, 365*3), size = 10000, replace = TRUE)),
  Hb_g_dL = runif(n = 10000, min = 10, max = 20)
) %>% 
  distinct() %>%
  arrange(patient_id, lab_date)

lab_results
#> # A tibble: 10,000 × 3
#>    patient_id lab_date   Hb_g_dL
#>         <int> <date>       <dbl>
#>  1          1 2020-01-15    14.0
#>  2          1 2020-01-18    18.0
#>  3          1 2020-01-23    17.8
#>  4          1 2020-02-02    16.6
#>  5          1 2020-02-03    10.7
#>  6          1 2020-02-18    13.0
#>  7          1 2020-02-23    11.2
#>  8          1 2020-02-28    19.5
#>  9          1 2020-03-02    10.2
#> 10          1 2020-03-10    16.8
#> # … with 9,990 more rows
#> # ℹ Use `print(n = ...)` to see more rows

lab_results %>%
  group_by(patient_id) %>%
  mutate(
    last_Hb_value_from_7_to_14_days_ago = 
      slide_index_dbl(
        .x = Hb_g_dL,
        .i = lab_date,
        .f = ~ last(.x),
        .before = 14, # from 14 days before the current date
        .after = -7, # to 7 days before the current date ("-7 after the current date")
        .complete = TRUE
      )
  ) %>%
  ungroup()
#> # A tibble: 10,000 × 4
#>    patient_id lab_date   Hb_g_dL last_Hb_value_from_7_to_14_days_ago
#>         <int> <date>       <dbl>                               <dbl>
#>  1          1 2020-01-15    14.0                                NA  
#>  2          1 2020-01-18    18.0                                NA  
#>  3          1 2020-01-23    17.8                                NA  
#>  4          1 2020-02-02    16.6                                17.8
#>  5          1 2020-02-03    10.7                                17.8
#>  6          1 2020-02-18    13.0                                NA  
#>  7          1 2020-02-23    11.2                                NA  
#>  8          1 2020-02-28    19.5                                13.0
#>  9          1 2020-03-02    10.2                                11.2
#> 10          1 2020-03-10    16.8                                10.2
#> # … with 9,990 more rows
#> # ℹ Use `print(n = ...)` to see more rows

I'm fairly confident this is what you want, so im going to close. But feel free to add another comment with any follow up comments

larsentom commented 1 year ago

Ah! I think I see what you mean!

I had initially tried .after = -14, .before = 7 (i.e., >= -14 days ago, <= -7 days ago) - which led to a range error message.

This way uses .before = 14, .after = -7 and seems to work ok - will experiment a bit more.

I am trying to understand the reason for my confusion.

I think I assumed something like:

".after is >= (...) and .before is <= -(...)"

whereas it seems like .before and .after are like:

".after is <= (...) and .before is >= -(...)"

Does this mean that e.g. looking after the event (7-14 days ahead) would mean using something like .before = -7, .after = 14?

If so, it seems that it would support the kind of range-behaviour required! I only wonder though if there would be a way to make this less ambiguous. For example, it seems like specifying '.after = 7' without any .before specification would mean 'pick all the values from (the current value) up to (7 days from the current value)' - a 7-day period. But intuitively, I would expect this to instead 'pick all the available values from 7 days after the current value' (from 7 days ahead to the last available value). Perhaps it's just me?

DavisVaughan commented 1 year ago

I think you should think of it as creating windows that look like:

[.i - .before, .i + .after]

It does exactly that in the codebase somewhere, and then it requires that .i - .before <= .i + .after to have a sensible window (which led to the errors you saw)

Does this mean that e.g. looking after the event (7-14 days ahead) would mean using something like .before = -7, .after = 14?

Yes exactly

Regarding your confusion about .before/.after, I'm hopeful that it is just you 😄 I think that the documentation for .before and .after in ?slide_index is pretty good at conveying the correct meaning (I'm biased of course, but still). It is also meant to map almost exactly to how it works in SQL with Window functions. i.e. this:

SELECT
  shop,
  date,
  revenue_amount,
  AVG(revenue_amount) OVER (
    PARTITION BY shop
    ORDER BY date ASC
    RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND CURRENT ROW
  ) AS moving_avg
FROM revenue_per_shop;

is very close to this grouped by shop:

slide_index_dbl(revenue_amount, date, mean, .before = days(1), .after = 0)

r-lib / slider

Support relative intervals with both beginning and end #171