fill missing data in incomplete epiweek

stephenturner commented 2 years ago

If we run a forecast on Monday where the previous week is incomplete, we'll run into a problem because the previous week will be removed, and our forecast horizons won't truly be horizon weeks long.

We can get around this issue by setting remove_incomplete=FALSE in prep, but this will make the last week's worth of data appear lower, which could cause problems with a time series approach.

Possible start to a solution implemented below.

We could determine what the last date of the incomplete epiweek is, then fill in the missing values separately for each state using the most recent nonmissing value (demonstrated here). Alternatively we could use a median/mean of the last n days (not demonstrated here, but not too difficult to implement).

One thing to consider here is if running this code on a Monday, if we actually have data on the previous Sunday starting that epiweek, then this would perform this "imputation" to the Saturday of the current week, into the future.

Opening this with some breadcrumbs to come back to should we find this to be a problem again in coming weeks.

suppressPackageStartupMessages({
  library(tidyverse)
  library(fiphde)
})

hdgov_hosp %>% 
  filter(state=="CA" | state=="NC") %>% 
  filter(date>="2022-03-13" & date <="2022-03-17") %>% 
  select(state:flu.admits.cov) %>% 
  clipr::write_clip()
#> Error in error_interactive(): To run write_clip() in non-interactive mode, either call write_clip() with allow_non_interactive = TRUE, or set the environment variable CLIPR_ALLOW=TRUE

hdgov_hosp <- tibble::tribble(
                ~state,        ~date, ~flu.admits, ~flu.admits.cov,
                  "CA", "2022-03-13",          7L,            367L,
                  "NC", "2022-03-13",          4L,            116L,
                  "CA", "2022-03-14",         10L,            368L,
                  "NC", "2022-03-14",          1L,            115L,
                  "CA", "2022-03-15",          5L,            402L,
                  "NC", "2022-03-15",          0L,            125L,
                  "CA", "2022-03-16",         11L,            404L,
                  "NC", "2022-03-16",          3L,            125L,
                  "CA", "2022-03-17",          4L,            404L,
                  "NC", "2022-03-17",          2L,            124L
                )

hdgov_hosp
#> # A tibble: 10 × 4
#>    state date       flu.admits flu.admits.cov
#>    <chr> <chr>           <int>          <int>
#>  1 CA    2022-03-13          7            367
#>  2 NC    2022-03-13          4            116
#>  3 CA    2022-03-14         10            368
#>  4 NC    2022-03-14          1            115
#>  5 CA    2022-03-15          5            402
#>  6 NC    2022-03-15          0            125
#>  7 CA    2022-03-16         11            404
#>  8 NC    2022-03-16          3            125
#>  9 CA    2022-03-17          4            404
#> 10 NC    2022-03-17          2            124

last_date <- max(hdgov_hosp$date)
last_date
#> [1] "2022-03-17"

last_epi <- MMWRweek::MMWRweek(last_date)
last_epi
#>   MMWRyear MMWRweek MMWRday
#> 1     2022       11       5

last_saturday <- MMWRweek::MMWRweek2Date(last_epi$MMWRyear, last_epi$MMWRweek, 7)
last_saturday
#> [1] "2022-03-19"

# issue a warning
if (last_date!=last_saturday) {
  warning(sprintf("Last day of data (%s) isn't last date of that epiweek (%s)", last_date, last_saturday))
}
#> Warning: Last day of data (2022-03-17) isn't last date of that epiweek
#> (2022-03-19)

# do stuff, e.g.:  if (fill_epiweek=TRUE) {...}
if (last_date!=last_saturday) {
}
#> NULL

new_dates <- seq.Date(from=as.Date(last_date)+1, to=as.Date(last_saturday), by="days")
new_dates
#> [1] "2022-03-18" "2022-03-19"

# dnm = Data with New dates Missing
dnm <- crossing(state=unique(hdgov_hosp$state), date=as.character(new_dates)) %>% 
  full_join(hdgov_hosp, .) %>% 
  arrange(date, state)
#> Joining, by = c("state", "date")
dnm
#> # A tibble: 14 × 4
#>    state date       flu.admits flu.admits.cov
#>    <chr> <chr>           <int>          <int>
#>  1 CA    2022-03-13          7            367
#>  2 NC    2022-03-13          4            116
#>  3 CA    2022-03-14         10            368
#>  4 NC    2022-03-14          1            115
#>  5 CA    2022-03-15          5            402
#>  6 NC    2022-03-15          0            125
#>  7 CA    2022-03-16         11            404
#>  8 NC    2022-03-16          3            125
#>  9 CA    2022-03-17          4            404
#> 10 NC    2022-03-17          2            124
#> 11 CA    2022-03-18         NA             NA
#> 12 NC    2022-03-18         NA             NA
#> 13 CA    2022-03-19         NA             NA
#> 14 NC    2022-03-19         NA             NA

# fill with most recent value
filled_down <- 
  dnm %>% 
  group_by(state) %>% 
  tidyr::fill(starts_with("flu"), starts_with("cov"), .direction = "down")
filled_down
#> # A tibble: 14 × 4
#> # Groups:   state [2]
#>    state date       flu.admits flu.admits.cov
#>    <chr> <chr>           <int>          <int>
#>  1 CA    2022-03-13          7            367
#>  2 NC    2022-03-13          4            116
#>  3 CA    2022-03-14         10            368
#>  4 NC    2022-03-14          1            115
#>  5 CA    2022-03-15          5            402
#>  6 NC    2022-03-15          0            125
#>  7 CA    2022-03-16         11            404
#>  8 NC    2022-03-16          3            125
#>  9 CA    2022-03-17          4            404
#> 10 NC    2022-03-17          2            124
#> 11 CA    2022-03-18          4            404
#> 12 NC    2022-03-18          2            124
#> 13 CA    2022-03-19          4            404
#> 14 NC    2022-03-19          2            124

# fill with mean of that week
# ???

stephenturner commented 2 years ago

Edited, fixed reprex

vpnagraj commented 2 years ago

any kind of imputation will introduce bias. after seeing the reporting issue resolved this week, i say we close this issue for now. if we need to revisit we definitely can.

signaturescience / fiphde

fill missing data in incomplete epiweek #113