r-lib / clock

A Date-Time Library for R
https://clock.r-lib.org
Other
97 stars 4 forks source link

Precision on `date_group()` is clamped at max value for unit instead of performing numerical conversion when n > max #359

Open matiasandina opened 10 months ago

matiasandina commented 10 months ago

I am not sure if this is by design or not, but I really like the behavior of date_group() and it just exploded in my face when trying to use it with more than 60 minutes. Here's a reproducible example:

now <- Sys.time()
dts <- clock::date_seq(from = now, to =  now + lubridate::hours(5), by = 600)
tibble::tibble(dts = dts,
               min_60 = clock::date_group(dts, precision = "minute", n = 60),
               # gets clamped at 1 hour mark
               min_120 = clock::date_group(dts, precision = "minute", n = 120),
               h_2 = clock::date_group(dts, precision = "hour", n = 2)
)

# A tibble: 31 × 4
   dts                 min_60              min_120             h_2                
   <dttm>              <dttm>              <dttm>              <dttm>             
 1 2023-08-11 16:38:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 2 2023-08-11 16:48:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 3 2023-08-11 16:58:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 4 2023-08-11 17:08:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
 5 2023-08-11 17:18:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
 6 2023-08-11 17:28:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
 7 2023-08-11 17:38:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
 8 2023-08-11 17:48:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
 9 2023-08-11 17:58:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
10 2023-08-11 18:08:29 2023-08-11 18:00:00 2023-08-11 18:00:00 2023-08-11 18:00:00
# ℹ 21 more rows
# ℹ Use `print(n = ...)` to see more rows

It would be nice for clock to automatically perform the conversion. If difficult to change, I think there should be some sort of verbose warning to the user, since the grouping gets performed silently and gives the illusion of it working properly.

This gets properly handled as I would expect with date_floor (conversion being made), so I'm not sure if date_group() is doing what it is supposed to be doing and I should be using date_floor instead.

tibble::tibble(dts = dts,
               min_60 = clock::date_group(dts, precision = "minute", n = 60),
               # gets clamped at 1 hour mark
               min_120 = clock::date_group(dts, precision = "minute", n = 120),
               h_2 = clock::date_group(dts, precision = "hour", n = 2),
               floor = clock::date_floor(dts, precision = "minute", n = 120)
)

# A tibble: 31 × 5
   dts                 min_60              min_120             h_2                 floor              
   <dttm>              <dttm>              <dttm>              <dttm>              <dttm>             
 1 2023-08-11 16:38:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 2 2023-08-11 16:48:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 3 2023-08-11 16:58:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 4 2023-08-11 17:08:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 5 2023-08-11 17:18:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 6 2023-08-11 17:28:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 7 2023-08-11 17:38:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 8 2023-08-11 17:48:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 9 2023-08-11 17:58:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
10 2023-08-11 18:08:29 2023-08-11 18:00:00 2023-08-11 18:00:00 2023-08-11 18:00:00 2023-08-11 18:00:00
# ℹ 21 more rows
# ℹ Use `print(n = ...)` to see more rows
DavisVaughan commented 10 months ago

The point of date_group() when used at "minute" precision is to group by n minutes within the current hour.

i.e. the "counter" which decides how to break up the groups resets at every hour.

Here is one of the Examples that tries to demonstrate this

x <- as.POSIXct("2019-01-01", "America/New_York")
x <- add_days(x, -3:5)

# Group by 2 days of the current month.
# Note that this resets at the beginning of the month, creating day groups
# of [29, 30] [31] [01, 02] [03, 04].
date_group(x, "day", n = 2)

If you just want to create buckets of 120 minutes anchored from an initial origin of 1970-01-01 00:00:00 then date_floor() is the right function for you.

date_floor() and date_group() work via very different algorithms for very different purposes https://clock.r-lib.org/articles/clock.html#grouping-rounding-and-shifting

matiasandina commented 10 months ago

Got it. My package function is a glorified wrapper of date_group() that basically bins before processing:

  data <- data %>%
    dplyr::mutate(bin = clock::date_group({{time_col}}, n = n, precision = precision))

I guess they should not change from minutes to hours like it's no big deal hoping for internal conversion. I am not sure whether edge cases should favor date_group() or floor_date() and it's a bit tough to think about the different use cases for the binning function (i.e., what do they expect when binning? should they try to use the same function to bin minutes, hours, and days?). It's likely that the floor_date() is the most intuitive behavior here, but I'm a bit uncertain because the design choice was to give freedom to the user (they can select whatever bin with one funtion). Unfortunately, they can come up with bins that produce different behavior:

x <- as.POSIXct("2019-01-01 23:00:00", "America/New_York")
x <- clock::add_minutes(x, seq(-119, 121, 18))

x
 [1] "2019-01-01 21:01:00 EST" "2019-01-01 21:19:00 EST" "2019-01-01 21:37:00 EST" "2019-01-01 21:55:00 EST"
 [5] "2019-01-01 22:13:00 EST" "2019-01-01 22:31:00 EST" "2019-01-01 22:49:00 EST" "2019-01-01 23:07:00 EST"
 [9] "2019-01-01 23:25:00 EST" "2019-01-01 23:43:00 EST" "2019-01-02 00:01:00 EST" "2019-01-02 00:19:00 EST"
[13] "2019-01-02 00:37:00 EST" "2019-01-02 00:55:00 EST"

# using `n` = 25 because 25 is not a divisor of 60 
# intuitively weird that the first block is 20:45 and not 21:00
clock::date_floor(x, "minute", n = 25)
 [1] "2019-01-01 20:45:00 EST" "2019-01-01 21:10:00 EST" "2019-01-01 21:35:00 EST" "2019-01-01 21:35:00 EST"
 [5] "2019-01-01 22:00:00 EST" "2019-01-01 22:25:00 EST" "2019-01-01 22:25:00 EST" "2019-01-01 22:50:00 EST"
 [9] "2019-01-01 23:15:00 EST" "2019-01-01 23:40:00 EST" "2019-01-01 23:40:00 EST" "2019-01-02 00:05:00 EST"
[13] "2019-01-02 00:30:00 EST" "2019-01-02 00:55:00 EST"

# This is behavior might make more sense in this particular case
clock::date_group(x, "minute", n = 25)
 [1] "2019-01-01 21:00:00 EST" "2019-01-01 21:00:00 EST" "2019-01-01 21:25:00 EST" "2019-01-01 21:50:00 EST"
 [5] "2019-01-01 22:00:00 EST" "2019-01-01 22:25:00 EST" "2019-01-01 22:25:00 EST" "2019-01-01 23:00:00 EST"
 [9] "2019-01-01 23:25:00 EST" "2019-01-01 23:25:00 EST" "2019-01-02 00:00:00 EST" "2019-01-02 00:00:00 EST"
[13] "2019-01-02 00:25:00 EST" "2019-01-02 00:50:00 EST"