signaturescience / fiphde

Forecasting Influenza in Support of Public Health Decision Making
https://signaturescience.github.io/fiphde/
GNU General Public License v3.0
3 stars 1 forks source link

Problem with forecast_categorical when a level isn't completely missing #171

Closed stephenturner closed 1 year ago

stephenturner commented 1 year ago

There's an issue with forecast_categorical(). It looks to see if a level is completely missing from the forecasts. E.g., large_increase would be unlikely to be seen for any location in recent data, so it would bind in large_increasewith value 0 for all locations. However, if at least one location has >0 probability for large_increase, it will not be perceived as missing_from_res in our current implementation, and would not be filled in for locations where large_increase was absent. A better approach is to cross everything, removing locations where you actually have observed >0 density, then bind that back into the original incomplete data.

Demonstration below. We have three locations, and we should expect 15 rows returned. In our current implementation we only get 12 rows returned.

suppressPackageStartupMessages(library(tidyverse))

res <- 
  tibble::tribble(
  ~forecast_date,                     ~target, ~location,      ~type,         ~type_id, ~value,
      "12/12/22", "2 wk flu hosp rate change",       10L, "category", "large_decrease",   0.25,
      "12/12/22", "2 wk flu hosp rate change",       10L, "category",       "decrease",   0.25,
      "12/12/22", "2 wk flu hosp rate change",       10L, "category",         "stable",   0.25,
      "12/12/22", "2 wk flu hosp rate change",       10L, "category",       "increase",   0.25,
      "12/12/22", "2 wk flu hosp rate change",       20L, "category", "large_decrease",    0.9,
      "12/12/22", "2 wk flu hosp rate change",       20L, "category",       "decrease",    0.1,
      "12/12/22", "2 wk flu hosp rate change",       30L, "category", "large_decrease",    0.4,
      "12/12/22", "2 wk flu hosp rate change",       30L, "category",       "decrease",    0.4,
      "12/12/22", "2 wk flu hosp rate change",       30L, "category",         "stable",    0.2
  )
res
#> # A tibble: 9 × 6
#>   forecast_date target                    location type     type_id        value
#>   <chr>         <chr>                        <int> <chr>    <chr>          <dbl>
#> 1 12/12/22      2 wk flu hosp rate change       10 category large_decrease  0.25
#> 2 12/12/22      2 wk flu hosp rate change       10 category decrease        0.25
#> 3 12/12/22      2 wk flu hosp rate change       10 category stable          0.25
#> 4 12/12/22      2 wk flu hosp rate change       10 category increase        0.25
#> 5 12/12/22      2 wk flu hosp rate change       20 category large_decrease  0.9 
#> 6 12/12/22      2 wk flu hosp rate change       20 category decrease        0.1 
#> 7 12/12/22      2 wk flu hosp rate change       30 category large_decrease  0.4 
#> 8 12/12/22      2 wk flu hosp rate change       30 category decrease        0.4 
#> 9 12/12/22      2 wk flu hosp rate change       30 category stable          0.2

# Current behavior
categories <- c("large_decrease", "decrease", "stable", "increase", "large_increase")
missing_from_res <- categories[!categories %in% res$type_id]
# Note that stable, increase, and large decrease is missing from location 20, but not from all the data
missing_from_res
#> [1] "large_increase"
add_to_res <- tidyr::crossing(forecast_date=unique(res$forecast_date),
                              target=unique(res$target),
                              location=unique(res$location),
                              type=unique(res$type),
                              type_id=missing_from_res,
                              value=0)
# So only "large_increase" will get added!
add_to_res
#> # A tibble: 3 × 6
#>   forecast_date target                    location type     type_id        value
#>   <chr>         <chr>                        <int> <chr>    <chr>          <dbl>
#> 1 12/12/22      2 wk flu hosp rate change       10 category large_increase     0
#> 2 12/12/22      2 wk flu hosp rate change       20 category large_increase     0
#> 3 12/12/22      2 wk flu hosp rate change       30 category large_increase     0
res <-
  res %>%
  dplyr::bind_rows(add_to_res) %>%
  dplyr::arrange(location) %>%
  dplyr::mutate(type_id=factor(type_id, levels=categories))
# This is still missing stable, large_increase from 20, and increase from 30
res
#> # A tibble: 12 × 6
#>    forecast_date target                    location type     type_id       value
#>    <chr>         <chr>                        <int> <chr>    <fct>         <dbl>
#>  1 12/12/22      2 wk flu hosp rate change       10 category large_decrea…  0.25
#>  2 12/12/22      2 wk flu hosp rate change       10 category decrease       0.25
#>  3 12/12/22      2 wk flu hosp rate change       10 category stable         0.25
#>  4 12/12/22      2 wk flu hosp rate change       10 category increase       0.25
#>  5 12/12/22      2 wk flu hosp rate change       10 category large_increa…  0   
#>  6 12/12/22      2 wk flu hosp rate change       20 category large_decrea…  0.9 
#>  7 12/12/22      2 wk flu hosp rate change       20 category decrease       0.1 
#>  8 12/12/22      2 wk flu hosp rate change       20 category large_increa…  0   
#>  9 12/12/22      2 wk flu hosp rate change       30 category large_decrea…  0.4 
#> 10 12/12/22      2 wk flu hosp rate change       30 category decrease       0.4 
#> 11 12/12/22      2 wk flu hosp rate change       30 category stable         0.2 
#> 12 12/12/22      2 wk flu hosp rate change       30 category large_increa…  0

# This is what we want to do. Cross everything, then remove the ones where you already have data, then bind back in.

# What are the names of the categories to forecast?
categories <- c("large_decrease", "decrease", "stable", "increase", "large_increase")
# Which ones are missing from the data?
# Fill those in with zeros
allcrossed <- tidyr::crossing(forecast_date=unique(res$forecast_date),
                              target=unique(res$target),
                              location=unique(res$location),
                              type=unique(res$type),
                              type_id=categories,
                              value=0)
res <- allcrossed %>% 
  anti_join(res, by=c("forecast_date", "target", "location", "type", "type_id")) %>% 
  bind_rows(res, .) %>% 
  arrange(location) %>% 
  dplyr::mutate(type_id=factor(type_id, levels=categories)) %>% 
  filter(!is.na(type_id))

res
#> # A tibble: 15 × 6
#>    forecast_date target                    location type     type_id       value
#>    <chr>         <chr>                        <int> <chr>    <fct>         <dbl>
#>  1 12/12/22      2 wk flu hosp rate change       10 category large_decrea…  0.25
#>  2 12/12/22      2 wk flu hosp rate change       10 category decrease       0.25
#>  3 12/12/22      2 wk flu hosp rate change       10 category stable         0.25
#>  4 12/12/22      2 wk flu hosp rate change       10 category increase       0.25
#>  5 12/12/22      2 wk flu hosp rate change       10 category large_increa…  0   
#>  6 12/12/22      2 wk flu hosp rate change       20 category large_decrea…  0.9 
#>  7 12/12/22      2 wk flu hosp rate change       20 category decrease       0.1 
#>  8 12/12/22      2 wk flu hosp rate change       20 category large_increa…  0   
#>  9 12/12/22      2 wk flu hosp rate change       20 category increase       0   
#> 10 12/12/22      2 wk flu hosp rate change       20 category stable         0   
#> 11 12/12/22      2 wk flu hosp rate change       30 category large_decrea…  0.4 
#> 12 12/12/22      2 wk flu hosp rate change       30 category decrease       0.4 
#> 13 12/12/22      2 wk flu hosp rate change       30 category stable         0.2 
#> 14 12/12/22      2 wk flu hosp rate change       30 category large_increa…  0   
#> 15 12/12/22      2 wk flu hosp rate change       30 category increase       0

Created on 2023-03-14 with reprex v2.0.2

vpnagraj commented 1 year ago

done in #173