r-lib / slider

Sliding Window Functions
https://slider.r-lib.org
Other
296 stars 12 forks source link

Error: Endpoints generated by `.before` cannot be `NA`. #129

Closed zhongwei-yao closed 3 years ago

zhongwei-yao commented 3 years ago

Thank you for the awesome package! I run into an error and have no idea how to fix it.

x <- as.Date(c("2000-02-29","2000-03-01","2000-03-02","2000-03-03","2000-03-06","2000-03-07", "2000-03-08", "2000-03-09", "2000-03-10", "2000-03-13","2000-03-14", "2000-03-15", "2000-03-16", "2000-03-17", "2000-03-20","2000-03-21", "2000-03-22", "2000-03-23", "2000-03-24", "2000-03-27","2000-03-28", "2000-03-29", "2000-03-30"))

# It works well excluding the last date "2000-03-30"
slide_index(x[1:22],x[1:22],~.x,.before=months(1) ,.complete = F) 

# Error occurred when use all 23 dates.
slide_index(x,x,~.x,.before=months(1) ,.complete = F) 
DavisVaughan commented 3 years ago

You have run into a very common issue with lubridate which I call invalid dates.

You can see that for 2000-03-30 - months(1), an invalid date of 2000-02-30 would be generated. Because this doesn't exist, lubridate defaults to returning NA.

library(lubridate)
library(slider)

x <- as.Date(c(
  "2000-02-29", "2000-03-01", "2000-03-02", "2000-03-03", "2000-03-06", 
  "2000-03-07", "2000-03-08", "2000-03-09", "2000-03-10", "2000-03-13",
  "2000-03-14", "2000-03-15", "2000-03-16", "2000-03-17", "2000-03-20",
  "2000-03-21", "2000-03-22", "2000-03-23", "2000-03-24", "2000-03-27",
  "2000-03-28", "2000-03-29", "2000-03-30"
))

x_tail <- tail(x)

x_tail
#> [1] "2000-03-23" "2000-03-24" "2000-03-27" "2000-03-28" "2000-03-29"
#> [6] "2000-03-30"

# Generates NA values on invalid dates
x_tail - months(1)
#> [1] "2000-02-23" "2000-02-24" "2000-02-27" "2000-02-28" "2000-02-29"
#> [6] NA

This is often a frustrating behavior that can happen silently. It has come up many times in lubridate issues! For this specific issue, a lubridate solution is to use %m-% rather than -. For invalid dates like this one, it will instead roll backwards to the previous valid date (2000-02-29, the end of that month). This is reasonable behavior. You end up with two 2000-02-29 values, but that is the price we have to pay for dealing with irregular calendrical data.

# Handles invalid dates by choosing the previous valid date
x_tail %m-% months(1)
#> [1] "2000-02-23" "2000-02-24" "2000-02-27" "2000-02-28" "2000-02-29"
#> [6] "2000-02-29"

Ideally you'd be able to do:

slide_index(x, x, ~.x, .before = ~.x %m-% months(1), .complete = FALSE)

but I haven't added anonymous function .before values to slider yet. I plan to eventually.

Instead you can get fairly close to that by manually specifying starts/stops with hop_index(). You can create the .starts with %m-%. (There is no .complete argument here, so it isn't exactly the same)

# Manually generate and specify start/stop values
starts <- x %m-% months(1)
stops <- x

results <- hop_index(
  .x = x, 
  .i = x, 
  .starts = starts, 
  .stops = stops, 
  .f = ~.x
)

tail(results, 3)
#> [[1]]
#>  [1] "2000-02-29" "2000-03-01" "2000-03-02" "2000-03-03" "2000-03-06"
#>  [6] "2000-03-07" "2000-03-08" "2000-03-09" "2000-03-10" "2000-03-13"
#> [11] "2000-03-14" "2000-03-15" "2000-03-16" "2000-03-17" "2000-03-20"
#> [16] "2000-03-21" "2000-03-22" "2000-03-23" "2000-03-24" "2000-03-27"
#> [21] "2000-03-28"
#> 
#> [[2]]
#>  [1] "2000-02-29" "2000-03-01" "2000-03-02" "2000-03-03" "2000-03-06"
#>  [6] "2000-03-07" "2000-03-08" "2000-03-09" "2000-03-10" "2000-03-13"
#> [11] "2000-03-14" "2000-03-15" "2000-03-16" "2000-03-17" "2000-03-20"
#> [16] "2000-03-21" "2000-03-22" "2000-03-23" "2000-03-24" "2000-03-27"
#> [21] "2000-03-28" "2000-03-29"
#> 
#> [[3]]
#>  [1] "2000-02-29" "2000-03-01" "2000-03-02" "2000-03-03" "2000-03-06"
#>  [6] "2000-03-07" "2000-03-08" "2000-03-09" "2000-03-10" "2000-03-13"
#> [11] "2000-03-14" "2000-03-15" "2000-03-16" "2000-03-17" "2000-03-20"
#> [16] "2000-03-21" "2000-03-22" "2000-03-23" "2000-03-24" "2000-03-27"
#> [21] "2000-03-28" "2000-03-29" "2000-03-30"
DavisVaughan commented 3 years ago

I'll also add that I've been working on a new package called clock, and one of its goals is to make handling invalid dates a little less frustrating by being more verbose about when they occur, and by giving you more tools to handle them. It isn't on CRAN yet, but here is a teaser:

You immediately get an error as soon as you hit invalid date issues when using clock::add_months()

library(clock)

x <- as.Date(c(
  "2000-02-29", "2000-03-01", "2000-03-02", "2000-03-03", "2000-03-06", 
  "2000-03-07", "2000-03-08", "2000-03-09", "2000-03-10", "2000-03-13",
  "2000-03-14", "2000-03-15", "2000-03-16", "2000-03-17", "2000-03-20",
  "2000-03-21", "2000-03-22", "2000-03-23", "2000-03-24", "2000-03-27",
  "2000-03-28", "2000-03-29", "2000-03-30"
))

x_tail <- tail(x)

x_tail
#> [1] "2000-03-23" "2000-03-24" "2000-03-27" "2000-03-28" "2000-03-29"
#> [6] "2000-03-30"

add_months(x_tail, -1)
#> Error: Invalid date found at location 6. Resolve invalid date issues by specifying the `invalid` argument.

You have multiple ways to resolve these invalid dates by using the invalid argument. Here are a few of the many options:

# Previous valid moment in time (Sort of like `%m-%`)
add_months(x_tail, -1, invalid = "previous")
#> [1] "2000-02-23" "2000-02-24" "2000-02-27" "2000-02-28" "2000-02-29"
#> [6] "2000-02-29"

# Next valid moment in time
add_months(x_tail, -1, invalid = "next")
#> [1] "2000-02-23" "2000-02-24" "2000-02-27" "2000-02-28" "2000-02-29"
#> [6] "2000-03-01"

# Lubridate `- month(1)` behavior 
add_months(x_tail, -1, invalid = "NA")
#> [1] "2000-02-23" "2000-02-24" "2000-02-27" "2000-02-28" "2000-02-29"
#> [6] NA

This is all part of the high level API for clock. There are all lower level types, such as year-month-day, that have the powerful ability of being able to represent these invalid dates directly.

ymd <- as_year_month_day(x_tail)

ymd
#> <year_month_day<day>[6]>
#> [1] "2000-03-23" "2000-03-24" "2000-03-27" "2000-03-28" "2000-03-29"
#> [6] "2000-03-30"

ymd_invalid <- add_months(ymd, -1)

# 1 invalid date, 2000-02-30
ymd_invalid
#> <year_month_day<day>[invalid=1][6]>
#> [1] "2000-02-23" "2000-02-24" "2000-02-27" "2000-02-28" "2000-02-29"
#> [6] "2000-02-30"

At this point you could resolve those invalid dates with invalid_resolve(invalid = ), which has the same options as where I specified invalid in add_months(), or you could leave it alone if you are going to be doing additional manipulations on that year-month-day type that might resolve the invalid dates automatically.

alistaire47 commented 2 years ago

(I know this is closed, but I just ran into this error and don't know whether anything is wrong, per se.)

I got this error when using the vic_elec dataset from tsibbledata. I'm not precisely sure why--the timestamps are clean and in order. I managed to solve it by converting the time zone to UTC instead of "Australia/Melbourne", so maybe it has to do with leap years or legal changes to the time or something?

Odd, anyway. Maybe more explanation in the error would help users?

library(tidyverse)

table(diff(tsibbledata::vic_elec$Time), useNA = 'ifany')
#> 
#>    30 
#> 52607

tsibbledata::vic_elec %>%
    mutate(
        demand_1d_mean = slider::slide_index_dbl(
            .x = Demand,
            .i = Time,
            .f = mean, 
            .before = lubridate::days(1)
        )
    )
#> Error: Problem with `mutate()` column `demand_1d_mean`.
#> i `demand_1d_mean = slider::slide_index_dbl(.x = Demand, .i = Time, .f = mean, .before = lubridate::days(1))`.
#> x Endpoints generated by `.before` cannot be `NA`.
#> i They are `NA` at locations: 13493, 13494, 30965, 30966,....

lubridate::tz(tsibbledata::vic_elec$Time)
#> [1] "Australia/Melbourne"

tsibbledata::vic_elec %>%
    select(-Temperature, -Holiday, -Date) %>%
    mutate(
        utc_time = lubridate::with_tz(Time, 'UTC'),
        demand_1d_mean = slider::slide_index_dbl(
            .x = Demand,
            .i = utc_time,
            .f = mean, 
            .before = lubridate::days(1)
        )
    )
#> # A tibble: 52,608 x 4
#>    Time                Demand utc_time            demand_1d_mean
#>    <dttm>               <dbl> <dttm>                       <dbl>
#>  1 2012-01-01 00:00:00  4383. 2011-12-31 13:00:00          4383.
#>  2 2012-01-01 00:30:00  4263. 2011-12-31 13:30:00          4323.
#>  3 2012-01-01 01:00:00  4049. 2011-12-31 14:00:00          4232.
#>  4 2012-01-01 01:30:00  3878. 2011-12-31 14:30:00          4143.
#>  5 2012-01-01 02:00:00  4036. 2011-12-31 15:00:00          4122.
#>  6 2012-01-01 02:30:00  3866. 2011-12-31 15:30:00          4079.
#>  7 2012-01-01 03:00:00  3694. 2011-12-31 16:00:00          4024.
#>  8 2012-01-01 03:30:00  3562. 2011-12-31 16:30:00          3966.
#>  9 2012-01-01 04:00:00  3433. 2011-12-31 17:00:00          3907.
#> 10 2012-01-01 04:30:00  3359. 2011-12-31 17:30:00          3852.
#> # ... with 52,598 more rows
DavisVaughan commented 2 years ago

It has to do with daylight saving time, see this example:

library(tsibbledata)
library(lubridate)

time <- vic_elec$Time

time[13493]
#> [1] "2012-10-08 02:00:00 AEDT"

# This returns NA silently
time[13493] - days(1)
#> [1] NA

# Going back 2 days works
time[13493] - days(2)
#> [1] "2012-10-06 02:00:00 AEST"

# There is a DST gap on 2012-10-07, where clocks went from
# 01:59:59 -> 03:00:00, skipping the 2 o'clock hour entirely
right_before_gap <- time[13493] - (days(1) + seconds(1))
right_before_gap
#> [1] "2012-10-07 01:59:59 AEST"
right_before_gap + 1
#> [1] "2012-10-07 03:00:00 AEDT"

# So the theoretical time of "2012-10-07 02:00:00" doesn't exist
# in the Melbourne time zone, and you silently get an NA

I think the easiest and most correct solution in this case is to just use ddays() rather than days(), which won't ever produce an NA result.

library(tsibbledata)
library(lubridate)

time <- vic_elec$Time

time[13493]
#> [1] "2012-10-08 02:00:00 AEDT"

# Goes back 24 hours. Like you waited around for exactly
# 86,400 seconds and then looked at the clock and this is what
# you saw
time[13493] - ddays(1)
#> [1] "2012-10-07 01:00:00 AEST"
library(tidyverse)
library(tsibbledata)
library(slider)
library(lubridate)

vic_elec %>%
  mutate(
    demand_1d_mean = slide_index_dbl(
      .x = Demand,
      .i = Time,
      .f = mean, 
      .before = ddays(1)
    )
  )
#> # A tibble: 52,608 × 6
#>    Time                Demand Temperature Date       Holiday demand_1d_mean
#>    <dttm>               <dbl>       <dbl> <date>     <lgl>            <dbl>
#>  1 2012-01-01 00:00:00  4383.        21.4 2012-01-01 TRUE             4383.
#>  2 2012-01-01 00:30:00  4263.        21.0 2012-01-01 TRUE             4323.
#>  3 2012-01-01 01:00:00  4049.        20.7 2012-01-01 TRUE             4232.
#>  4 2012-01-01 01:30:00  3878.        20.6 2012-01-01 TRUE             4143.
#>  5 2012-01-01 02:00:00  4036.        20.4 2012-01-01 TRUE             4122.
#>  6 2012-01-01 02:30:00  3866.        20.2 2012-01-01 TRUE             4079.
#>  7 2012-01-01 03:00:00  3694.        20.1 2012-01-01 TRUE             4024.
#>  8 2012-01-01 03:30:00  3562.        19.6 2012-01-01 TRUE             3966.
#>  9 2012-01-01 04:00:00  3433.        19.1 2012-01-01 TRUE             3907.
#> 10 2012-01-01 04:30:00  3359.        19.0 2012-01-01 TRUE             3852.
#> # … with 52,598 more rows