tidyverse / lubridate

Make working with dates in R just that little bit easier
https://lubridate.tidyverse.org
GNU General Public License v3.0
724 stars 207 forks source link

round_date / floor_date / ceiling_date: week start when using a period as unit #1132

Closed TimBMK closed 11 months ago

TimBMK commented 11 months ago

The round_date()-functions (like ceiling_date() and floor_date()) documentation states that periods, such as weeks(1), can be provided as unit. This is great for use in higher level functions, where an arbitrary time period may be chosen to floor or ceiling a date. However, the results between unit = "weeks" and unit = weeks(1) differ. Specifically, unit = weeks(1) seems to set the week_start to Friday, and ignores the week_start argument. See reprex below.

Am I missing something here or is this a bug?

library(tidyverse)

example <-
  tibble(
    datetime = c(
      "2022-07-05 04:32:55 UTC",
      "2022-07-09 04:42:47 UTC",
      "2022-07-18 04:45:42 UTC"
    )
  )

# see how the results differ between the two unit arguments
example %>% mutate(
  period_unit = ceiling_date(as.Date(datetime), unit = weeks(1)),
  period_str = ceiling_date(as.Date(datetime), unit = "weeks")
)

# weeks(1) also ignores the week_start argument
example %>% mutate(
  period_unit = ceiling_date(as.Date(datetime), unit = weeks(1), week_start = 3),
  period_str = ceiling_date(as.Date(datetime), unit = "weeks", week_start = 3)
)
> sessionInfo()
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.1   tidyverse_2.0.0 lubridate_1.9.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10        quanteda_3.2.2     pillar_1.9.0       compiler_4.2.2     tools_4.2.2        stopwords_2.3      digest_0.6.31      bit_4.0.5         
 [9] gtable_0.3.0       timechange_0.2.0   lifecycle_1.0.3    lattice_0.20-45    pkgconfig_2.0.3    rlang_1.1.1        Matrix_1.5-1       fastmatch_1.1-3   
[17] cli_3.6.1          rstudioapi_0.14    parallel_4.2.2     withr_2.5.0        hms_1.1.3          generics_0.1.3     vctrs_0.6.2        globals_0.16.2    
[25] bit64_4.0.5        grid_4.2.2         tidyselect_1.2.0   glue_1.6.2         listenv_0.9.0      R6_2.5.1           fansi_1.0.4        parallelly_1.35.0 
[33] vroom_1.6.3        tzdb_0.3.0         magrittr_2.0.3     scales_1.2.0       codetools_0.2-19   colorspace_2.0-3   future_1.32.0      utf8_1.2.3        
[41] stringi_1.7.12     munsell_0.5.0      RcppParallel_5.1.5 crayon_1.5.2  
vspinu commented 11 months ago

This is because week(1) is internally a "7d 0H 0M 0S"

> weeks(1)
[1] "7d 0H 0M 0S"
> str(weeks(1))
Formal class 'Period' [package "lubridate"] with 6 slots
  ..@ .Data : num 0
  ..@ year  : num 0
  ..@ month : num 0
  ..@ day   : num 7
  ..@ hour  : num 0
  ..@ minute: num 0

So you get a ceiling and flooring to the multiple of 7 days from the begining of the month. I agree that this is a bit confusing but I am afraid there is no solution for this given the current internal representation of the periods.

I am clarifying the documentation regarding exactly this point.

TimBMK commented 10 months ago

Thanks for the clarification! Yes, it makes sense that it starts from the first day of the month. It's a little confusing when you expect weeks(1) to behave like an equivalent of the "weeks" unit, especially considering the _weekstart argument that is silently getting ignored in this case.

In practice, I've used the weeks() unit specifically to avoid the fiddly month start dates, e.g. by replacing months(3) with weeks(12) to get more consistent timeframe slices over longer periods. So I'm a little surprised weeks() is tied to the monthly start date here, but that might just be me misunderstanding the exact nature of the weeks() period.

Anyways, thank you for this great package. It certainly makes life a lot easier when dealing with dates!