r-lib / slider

Sliding Window Functions
https://slider.r-lib.org
Other
296 stars 12 forks source link

How to return the timestamp at the center of sliding window #161

Closed xgirouxb closed 3 years ago

xgirouxb commented 3 years ago

Hello, thanks for this marvelous package. I'm wondering if there is something obvious I've missed the in the documentation, but I am unnable to return the timestamp for the observation at the center of the window in slide_period.

# Import packages
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(slider)

# Set seed
set.seed(42)

# Data
tbl <- tibble::tibble(
  datestamp = lubridate::as_datetime(
    x = c(
      "2004-04-04 09:02:31 UTC", "2004-04-04 16:01:06 UTC",
      "2004-04-04 23:01:25 UTC", "2004-04-05 06:00:37 UTC",
      "2004-04-05 13:00:37 UTC", "2004-04-05 20:01:07 UTC",
      "2004-04-06 03:00:38 UTC", "2004-04-06 10:00:38 UTC",
      "2004-04-06 17:00:41 UTC", "2004-04-07 00:01:06 UTC",
      "2004-04-07 07:01:07 UTC", "2004-04-07 14:00:38 UTC"
    )
  ),
  value = rnorm(n = 12)
)

# Helper function
get_mean_value <- function(data){
  dplyr::summarise(
    .data = data,
    datestamp = median(datestamp),
    mean_value = mean(value),
  )
}

# New data with mean value in moving window
tbl_mean <- slider::slide_period_dfr(
  .x = tbl,
  .i = tbl$datestamp,
  .period = "hour",
  .f = get_mean_value,
  .before = 24,
  .after = 24
)

# The datestamps in tbl_mean are not the original
identical(tbl$datestamp, tbl_mean$datestamp)
#> [1] FALSE

Created on 2021-09-22 by the reprex package (v2.0.0)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.3 (2020-10-10) #> os Linux Mint 20 #> system x86_64, linux-gnu #> ui X11 #> language en_CA:en #> collate en_CA.UTF-8 #> ctype en_CA.UTF-8 #> tz America/Vancouver #> date 2021-09-22 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3) #> blob 1.2.1 2020-01-20 [1] CRAN (R 4.0.2) #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.0.3) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3) #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.0.3) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.3) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.3) #> fs 1.4.1 2020-04-04 [1] CRAN (R 4.0.2) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.3) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2) #> knitr 1.30 2020-09-22 [1] CRAN (R 4.0.3) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.3) #> lubridate * 1.7.10 2021-02-26 [1] CRAN (R 4.0.3) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.0.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.0.3) #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.0.3) #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.3) #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.0.3) #> rmarkdown 2.5 2020-10-21 [1] CRAN (R 4.0.3) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.3) #> slider * 0.1.5 2020-07-21 [1] CRAN (R 4.0.3) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.3) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> styler 1.5.1 2021-07-13 [1] CRAN (R 4.0.3) #> tibble 3.1.4 2021-08-25 [1] CRAN (R 4.0.3) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.0.3) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.0.3) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.3) #> warp 0.2.0 2020-10-21 [1] CRAN (R 4.0.3) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3) #> xfun 0.19 2020-10-30 [1] CRAN (R 4.0.3) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /home/rangifer/R/x86_64-pc-linux-gnu-library/4.0 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```
DavisVaughan commented 3 years ago

Is it possible you want something a little simpler, using slide_index_dbl()? That would allow you to line the resulting means up with the existing date-time column.

library(dplyr)
library(lubridate)
library(slider)

# Set seed
set.seed(42)

# Data
tbl <- tibble::tibble(
  datestamp = lubridate::as_datetime(
    x = c(
      "2004-04-04 09:02:31 UTC", "2004-04-04 16:01:06 UTC",
      "2004-04-04 23:01:25 UTC", "2004-04-05 06:00:37 UTC",
      "2004-04-05 13:00:37 UTC", "2004-04-05 20:01:07 UTC",
      "2004-04-06 03:00:38 UTC", "2004-04-06 10:00:38 UTC",
      "2004-04-06 17:00:41 UTC", "2004-04-07 00:01:06 UTC",
      "2004-04-07 07:01:07 UTC", "2004-04-07 14:00:38 UTC"
    )
  ),
  value = rnorm(n = 12)
)

# Take the current value, and create a window of:
# [current - 24hrs, current + 24hrs]
# and take the average of whatever falls in that window
tbl %>%
  mutate(
    mean = slide_index_dbl(
      .x = value, 
      .i = datestamp, 
      .f = mean, 
      .before = dhours(24), 
      .after = dhours(24)
    )
  )
#> # A tibble: 12 × 3
#>    datestamp             value  mean
#>    <dttm>                <dbl> <dbl>
#>  1 2004-04-04 09:02:31  1.37   0.451
#>  2 2004-04-04 16:01:06 -0.565  0.441
#>  3 2004-04-04 23:01:25  0.363  0.350
#>  4 2004-04-05 06:00:37  0.633  0.516
#>  5 2004-04-05 13:00:37  0.404  0.307
#>  6 2004-04-05 20:01:07 -0.106  0.676
#>  7 2004-04-06 03:00:38  1.51   0.615
#>  8 2004-04-06 10:00:38 -0.0947 0.711
#>  9 2004-04-06 17:00:41  2.02   0.980
#> 10 2004-04-07 00:01:06 -0.0627 1.16 
#> 11 2004-04-07 07:01:07  1.30   1.09 
#> 12 2004-04-07 14:00:38  2.29   1.39

slide_period() and its variants do something a little more complex. With .period = "hour", that first creates 1 hour blocks starting from the origin, which defaults to 1970-01-01 00:00:00 UTC here. So, it creates blocks of time like:

[1970-01-01 00:00:00, 1970-01-01 01:00:00), [1970-01-01 01:00:00, 1970-01-01 02:00:00)

Then it figures out which block .i falls in. For example, the first time, 2004-04-04 09:02:31, falls in the block of: [2004-04-04 09:00:00, 2004-04-04 10:00:00). Now that we have the "current" block, it also includes everything in the 24 previous blocks (for .before) and 24 following blocks (for .after). This gives us a total interval of [2004-04-03 09:00:00, 2004-04-05 10:00:00) to search in to find values that we should use to compute the first mean.

This is easiest to see if you use the index as .x too, and make the function identity() to just return the index:

slide_period(
  .x = tbl$datestamp,
  .i = tbl$datestamp,
  .period = "hour",
  .f = identity,
  .before = 24,
  .after = 24
)
#> [[1]]
#> [1] "2004-04-04 09:02:31 UTC" "2004-04-04 16:01:06 UTC"
#> [3] "2004-04-04 23:01:25 UTC" "2004-04-05 06:00:37 UTC"
#> 
#> [[2]]
#> [1] "2004-04-04 09:02:31 UTC" "2004-04-04 16:01:06 UTC"
#> [3] "2004-04-04 23:01:25 UTC" "2004-04-05 06:00:37 UTC"
#> [5] "2004-04-05 13:00:37 UTC"
#> 
#> [[3]]
#> [1] "2004-04-04 09:02:31 UTC" "2004-04-04 16:01:06 UTC"
#> [3] "2004-04-04 23:01:25 UTC" "2004-04-05 06:00:37 UTC"
#> [5] "2004-04-05 13:00:37 UTC" "2004-04-05 20:01:07 UTC"
#>
#> ...and more here
xgirouxb commented 3 years ago

Is it possible you want something a little simpler, using slide_index_dbl()?

Yes, you are absolutely right, that is indeed what I was looking for. I assumed that the "index" in slide_index() meant that the width of the sliding window was specified in number of observations before/after the current observation (which could be useful for regular time series). Inversely, I thought the function slide_period() meant that the window width was a specified duration before and after an observation.

Thank you so much for the explanation, I appreciate how hard naming functions can be and I now fully understand what you were aiming for. Sorry for the confusion!