signaturescience / focustools

Forecasting COVID-19 in the US
https://signaturescience.github.io/focustools/
GNU General Public License v3.0
0 stars 0 forks source link

data prep to translate target forecasts to submission file format #6

Closed vpnagraj closed 3 years ago

vpnagraj commented 3 years ago

the COVID-19 forecast hub has strict requirements for the forecast submission format:

https://github.com/reichlab/covid19-forecast-hub/blob/master/data-processed/README.md#forecast-file-format

once we generate point / quantile forecasts for targets we need to execute some post-processing to wrangle the data into the required format

this could include:

once we have the submission file format prepped, we can validate locally:

vpnagraj commented 3 years ago

NOTE the data prep of forecast output will depend on the forecast method used ... but given that we are likely starting with a time series model (and using the fable framework) then the prep should work for whatever method we land on for initial implementation (ARIMA, ETS, etc)

stephenturner commented 3 years ago

Initial shot at this in 6bbd2fc. Lots of outstanding issues here.

forecast_date target target_end_date location type quantile value
2020-12-22 1 wk ahead inc case 2020-12-12 US point NA 1595791
2020-12-22 2 wk ahead inc case 2020-12-19 US point NA 1717279
2020-12-22 3 wk ahead inc case 2020-12-26 US point NA 1874981
2020-12-22 4 wk ahead inc case 2021-01-02 US point NA 1992527
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.025 1518044
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.025 1477875
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.025 1502830
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.025 1521864
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.100 1539116
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.100 1615797
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.100 1725278
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.100 1784281
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.250 1578533
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.250 1670508
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.250 1805991
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.250 1894356
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.500 1590791
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.500 1712935
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.500 1876995
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.500 1995373
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.750 1614622
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.750 1777245
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.750 1965876
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.750 2119008
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.900 1656737
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.900 1843507
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.900 2060315
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.900 2236633
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.975 1717527
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.975 1911562
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.975 2134267
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.975 2374745
stephenturner commented 3 years ago

I have this working in some code at f2c3e91

  1. Fit separate models for each outcome (inc cases, inc deaths, cum deaths). (I tried fitting multiple models in the same fit objects with different dependent variables, fable complains: you can't have a mable (model table) with different Y vars). So, for now, different model objectes.
  2. Pass them to the format_fit_for_submission() function. This produces the forecast at the desired horizon, bootstraps each model fit 1000 times, gets the quibble (quantile tibble) for each fit using 23 quantiles, then restricts down to the smaller subset of quantiles if you're looking at inc cases.
  3. Bind rows from each of these function calls together from each metric you're looking at to create the final submission.

fable-submission-mockup-allmetrics.csv.txt (github doesn't let you upload .csv extensions, remove the .txt)

Notes / known issues:

@vpnagraj run through this code a pipe at a time, see if you have any suggestions.

vpnagraj commented 3 years ago

stepped through what you have (using the *-allmetrics version of the script)

pushed up some edits:

https://github.com/signaturescience/focustools/commit/344d7122ce6435ff81fcfd4d48e2664c20c2d681

i think i have a candidate fix for the text formatting conversion of "icases" to "inc cases" ... just pass in a new argument for target_name ? seems to be working

also played around with the dates a little bit. agreed that something is way off. i reworked your code, thought i had fixed the issue (to get the epiweek date starting on sunday instead of monday) but now that i'm looking at this issue again it looks the same as your comment above (https://github.com/signaturescience/focustools/issues/6#issuecomment-749668242)

🤔

im wondering if get_cases() and our exclusion of last week (because it's incomplete) is throwing a wrench here ...

vpnagraj commented 3 years ago

@stephenturner FYI looks like get_cases() and get_deaths() did include logic to remove the current week. that same (or similar) logic was implemented in in the TS modeling code:

https://github.com/signaturescience/focustools/blob/master/scratch/fable-submission-mockup-allmetrics.R#L29

i think it's better to do handle it that way ^ ... ie lets drop the current week exclusion from get_cases() and get_deaths()

done in https://github.com/signaturescience/focustools/commit/4aa7bdb20ec5b3dd392d2ad9f721818ae539f680

so that saved us one week of data. we're still bumping into the issue with horizon being k + 1 week (current week that we can't/shouldn't use in modeling because it is incomplete)

need to keep thinking on this ...

stephenturner commented 3 years ago

I'm still cracking at this. I think the problem comes in with mmwrweeks being converted to dates, then to yearweeks, then back to dates.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Saturday
MMWRweek::MMWRweek("2020-12-26")$MMWRweek
#> [1] 52
# Sunday
MMWRweek::MMWRweek("2020-12-27")$MMWRweek
#> [1] 53
# Monday
MMWRweek::MMWRweek("2020-12-28")$MMWRweek
#> [1] 53

# Sunday
MMWRweek::MMWRweek2Date(2020, 53, MMWRday = 1)
#> [1] "2020-12-27"
MMWRweek::MMWRweek2Date(2020, 53, MMWRday = 1) %>% tsibble::yearweek()
#> <yearweek[1]>
#> [1] "2020 W52"
#> # Week starts on: Monday
MMWRweek::MMWRweek2Date(2020, 53, MMWRday = 1) %>% tsibble::yearweek() %>% lubridate::as_date()
#> [1] "2020-12-21"

Still trying to craft that reprex.

stephenturner commented 3 years ago

I pushed some code in a new script at 6ff70b4. Run through that. I think the ~best~ oversimplified approach here might be simply adding a +6 or -7 or whatever somewhere.

stephenturner commented 3 years ago

I the date issue is fixed now. I'm creating the tsibble with a function that adds a monday column, which is the monday of that epiweek, and bases the yearweek tsibble index column based on that week. Later, after modeling/forecasting, I get the as_date() of that yweek, which returns the monday of that (1, 2, 3, or 4) week ahead forecast, and +days(5) to get the saturday that ends that epiweek.

From the https://github.com/reichlab/covid19-forecast-hub#ensemble-model section:

For inclusion in the ensemble, we additionally require that forecasts include a full set of 23 quantiles to be submitted for each of the one through four week ahead values for forecasts of deaths, and a full set of 7 quantiles for the one through four week ahead values for forecasts of cases (see technical README for details), and that the 10th quantile of the predictive distribution for a 1 week ahead forecast of cumulative deaths is not below the most recently observed data.

I don't think the current forecasts based on the auto ARIMA models are doing this, but we should probably add a check/correction for this case, that if the 10th quantile of any cumulative forecast is below the most recently observed data, then make it equal to the most recent observed data, at a minimum.

stephenturner commented 3 years ago

This check for forecasts for cumulative deaths not below current week values is now implemented in ae43487. But I haven't yet figured out the best place for this to reside, functionally. The format_fit_for_submission Takes as input the model table (output from model()), and doesn't actually take any data as input. The current week's cumulative death value actually resides in the data. If we wrote one monster function that did both modeling, forecasting, and formatting, we could do this here, because that function would have to take the data as input, not the models. But I kind of like keeping them separate for now, because it makes tinkering around the the modeling a bit easier, doing it outside of some monster function call. Anyway, for now, the bolt-on fix in ae43487 works, and we can sort out how to best modularize/functionalize this later. @vpnagraj if you wouldn't mind, run through the script https://github.com/signaturescience/focustools/blob/master/scratch/fable-submission-mockup-allmetrics.R to see if this all looks legit to you.

stephenturner commented 3 years ago

I added some code in cfdd1e8e1813dc558215d43e0a764c0340e8a130 to use the script added in 63a2ff968a69e61f8c6b7c7d50f630c0d6d2bb8e to validate the submission.

> forecast_filename <- here::here("scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv")
> validate_file(forecast_filename)

 Validating /Users/sturner/sigsci/irad/focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv ...
VALIDATED: filename 
VALIDATED: column names
VALIDATED: no NA values
Warning in verify_targets(entry) :
  ERROR: Some entries in `targets` do not correspond to standards:1 wk ahead cum deaths, 1 wk ahead inc cases, 1 wk ahead inc deaths, 2 wk ahead cum deaths, 2 wk ahead inc cases, 2 wk ahead inc deaths, 3 wk ahead cum deaths, 3 wk ahead inc cases, 3 wk ahead inc deaths, 4 wk ahead cum deaths, 4 wk ahead inc cases, 4 wk ahead inc deaths
VALIDATED: date format 
VALIDATED: forecast_date, target_end_date
VALIDATED: no quantile crossing
VALIDATED: temporal monotonicity
VALIDATED: cum geq inc
VALIDATED: entries of `quantile`

So, everything seems to look okay except for the targets.

The code https://github.com/reichlab/covid19-forecast-hub/blob/68df08d9e6e19d55fddab4bd5abb505202023ecb/code/validation/R-scripts/functions_plausibility.R#L169-L186, checks for 1, 2, 3, 4 wk ahead inc death and cum deaths, but doesn't allow for inc cases:

#' Checking that all entries in `target` correspond to standards
#'
#' @param entry the data.frame
#'
#' @return invisibly returns TRUE if problems detected, FALSE otherwise
verify_targets <- function(entry){
  allowed_targets <- c(
    paste(0:130, "day ahead inc death"),
    paste(0:130, "day ahead cum death"),
    paste(0:20, "wk ahead inc death"),
    paste(0:20, "wk ahead cum death"),
    paste(0:130, "day ahead inc hosp")
  )
  targets_in_entry <- unique(entry$target)
  if(!all(targets_in_entry %in% allowed_targets)){
    warning("ERROR: Some entries in `targets` do not correspond to standards:",
            paste0(targets_in_entry[!(targets_in_entry %in% allowed_targets)], collapse = ", "))
    return(invisible(FALSE))
  }else{
    cat("VALIDATED: targets\n")
    return(invisible(TRUE))
  }
}

This doesn't jive with what I thought was required here to be included in the ensemble forecast (https://github.com/reichlab/covid19-forecast-hub/tree/master/data-processed#target). Perhaps this R code is no longer maintained. According to the documentation at https://github.com/reichlab/covid19-forecast-hub/blob/master/data-processed/R_forecast_file_validation.md,

For those familiar with R (but not python), there is a separate set of tests that may be useful to diagnose data formatting issues in functions_plausibility.R. We have tried to keep these in sync with the python checks automatically run during a pull request, but have now stopped maintaining the checks in R. They are kept in the repository merely as an additional resource for teams who work exclusively with R. If you discover major discrepancies, you can nonetheless let us know and we may address them as time permits.

... in fact, after digging around a little bit, it seems like this is the case!

That R script, https://github.com/reichlab/covid19-forecast-hub/blob/master/code/validation/R-scripts/functions_plausibility.R, was last updated in May. According to the README, https://github.com/reichlab/covid19-forecast-hub/tree/master/data-processed#removed-targets, N day ahead inc cases was removed in June.

stephenturner commented 3 years ago

To the script in our utils/ folder, I added N wk ahead in case to the allowed targets in af56e43.

This let the results pass that validation check (after changing 'deaths' to 'death' and 'cases' to 'case' in 5de29cdd4f9e02240e875a35472f20a0b9239105). But another validation effort failed:

> validate_file(forecast_filename)

 Validating /Users/sturner/sigsci/irad/focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv ...
VALIDATED: filename 
VALIDATED: column names
VALIDATED: no NA values
VALIDATED: targets
VALIDATED: date format 
VALIDATED: forecast_date, target_end_date
 Error in if (any(is_crossing)) { : missing value where TRUE/FALSE needed 

I dug into the validation scripts and there's a place right around here https://github.com/reichlab/covid19-forecast-hub/blob/68df08d9e6e19d55fddab4bd5abb505202023ecb/code/validation/R-scripts/functions_plausibility.R#L259-L282 where it checks for "quantile crossing". I'm not exactly sure what this is doing yet, but I think what's causing a problem here is that some targets have different quantiles required than others. inc deaths and cum deaths require a larger set of quantiles, while N wk ahead inc case (the newly added target) requires only a subset of those quantiles. This is spelled out in the data submission readme here.

I think this causes a problem with this old legacy code because one of the operations it performs is a widening reshape, and when there are some targets with a subset of quantiles compared to other targets, you end up with NAs in the wide matrix. I still don't fully understand what this check is looking for, but I silenced this validation problem in a2cd111 by omitting NAs from this crossing check. All the others were FALSE. This obviates the Error seen above, and all validation checks pass.

> validate_file(forecast_filename)

 Validating /Users/sturner/sigsci/irad/focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv ...
VALIDATED: filename 
VALIDATED: column names
VALIDATED: no NA values
VALIDATED: targets
VALIDATED: date format 
VALIDATED: forecast_date, target_end_date
VALIDATED: no quantile crossing
VALIDATED: temporal monotonicity
VALIDATED: cum geq inc
VALIDATED: entries of `quantile`

CAVEAT: This works, but given the hacks I had to put into place to get this working, I'd recommend we either:

  1. Switch to the officially supported instructions for validating locally, https://github.com/reichlab/covid19-forecast-hub/wiki/Running-Checks-Locally
  2. Or else look around to see if someone else has forked and kept this R code up to date.

If we can find #-2 above, it sure would be more lightweight than going the #-1 route, which requires updating the upstream of the fork, installing some python pkgs, etc. Perhaps it isn't as burdensome as I think. I'll give it a spin on darwin if I can before our meeting today.

stephenturner commented 3 years ago

Follow up -- #-1 is pretty trivial. I set up a new conda environment, and followed the instructions at https://github.com/reichlab/covid19-forecast-hub/wiki/Running-Checks-Locally to install requirements and validate a single forecast file.

On darwin:

(focus) sturner@darwin:/data/projects/focus/covid19-forecast-hub$ python3 code/validation/validate_single_forecast_file.py ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv

VALIDATING ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv
✓ ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv is valid with no errors

🎉 🥳 🌟 ✔️

vpnagraj commented 3 years ago

@stephenturner parallel thought here ...

what if we put that the python validation script / pkgs in a docker image ... and wrapped a call to taht docker image in an R function (i.e. using something like stevedore) ?

i can help with that if want to pursue. shouldn't be too big of a lift. BUT we'd obviously still need to makes sure that validation code stays current

stephenturner commented 3 years ago

I'd almost always prefer to call an R function than issue a python command/script at the bash shell. Looks like the requirements are pretty minimal.

https://github.com/reichlab/covid19-forecast-hub/blob/master/visualization/requirements.txt

vpnagraj commented 3 years ago

agreed. see https://github.com/signaturescience/focustools/issues/9

vpnagraj commented 3 years ago

@stephenturner heads up i've heavily refactored the scratch submission mockup code:

https://github.com/signaturescience/focustools/blob/master/scratch/fable-submission-mockup-allmetrics.R

things to note:

VALIDATING ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-05-sigsci-ts.csv
✘ Error in ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-05-sigsci-ts.csv. Error(s):
 ["target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'point', 'NA', '369373']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.01', '367031']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.025', '367118']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.05', '367358']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.1', '367969']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.15', '368250']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.2', '368334']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.25', '368688']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.3', '369017']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.35', '369038']", 'target_end_date was ...']

any thoughts on that ^ ?

stephenturner commented 3 years ago

I don't know, unless it expected the target end date for 1 week ahead to end on the following saturday if you're dating the forecast after monday? I feel like I've seen something to this effect in the docs. Let me dig.

vpnagraj commented 3 years ago

sheesh.

well maybe thats OK? i mean im working on writing the validation wrapper for the python method now. we can stick to validating only before we are ready to submit on the sunday or monday. so as long as we generate the forecasts/validate on sunday or monday (before deadline) it should be fine? i think?

stephenturner commented 3 years ago

See #26. Reopening because there's currently a line hard-coding "US" as the location:

https://github.com/signaturescience/focustools/blob/e600847353be71b53886689b3c3af147bc247d97/R/submission.R#L72

dplyr::mutate(location="US", forecast_date=lubridate::today())

This will not allow for state or county-level granularity.

vpnagraj commented 3 years ago

@stephenturner see https://github.com/signaturescience/focustools/blob/state-level-ts/R/submission.R#L72

i removed the location="US" that was hardcoded in there. the forecast object should include a location column generate with get_cases() / get_deaths():

we do need to convert the state/territory name to appropriate FIPS:

https://github.com/signaturescience/focustools/blob/state-level-ts/R/submission.R#L72

i think that will be a simple join to focustools:::locations somewhere in format_for_submission() ?

vpnagraj commented 3 years ago

sorry to steamroll you here @stephenturner but i'm cooking on this state level stuff!

i just pushed up an edit to format_for_submission() that addresses the location join

that piece seems to be working now. mostly.

i'm seeing the following issues in validate_forecast() (full output at bottom of this comment):

[1] "entries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '02', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '66', 'quantile', '0.05', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '78', 'quantile', '0.01', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '78', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '78', 'quantile', '0.05', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '02', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '66', 'quantile', '0.05', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '78', 'quantile', '0.01', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '78', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '3 wk ahead inc death', '2021-02-06', '02', 'quantile', '0.025', '-1']\nentries in the `valu...\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'point', 'NA', '1426']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.01', '877']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.025', '881']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.05', '885']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.1', '887']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.15', '888']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.2', '889']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.25', '890']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.3', '890']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.35', '892']\ninvalid location for...\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'point', 'NA', '6533']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.01', '6249']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.025', '6249']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.05', '6391']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.1', '6401']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.15', '6408']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.2', '6432']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.25', '6445']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.3', '6446']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.35', '6446']\ntarget_end_date was ..."
stephenturner commented 3 years ago

unless we want to stick a condition in format_for_submission() that bounds all values at min 0? i kind of think that should go elsewhere ...

Bound it at zero for now. We could get more sophisticated... for cum deaths we would bound point and all quantiles at no less than the last week's current data. Inc death/cases- seems reasonable that the +1wk ahead should be no less than 2x the difference between 0 and -1wk. Or +2wk ahead should be no less than 2x difference between 0 and -2x. And still bounded at zero. I.e., enforcing that you can't drop incident cases/deaths more than twice as much as they changed in a previous horizon backward?

Where to do it? Agree doesn't really belong in a formatting script. But the ts_forecast doesn't yet track the data (#17), so you'd have to supply that as an arg there. Perhaps some final thing after formatting for submission, something like bound_submission(submission, data)? Although that could get tricky with submissions with multiple location granularities (eg from a bind_rows on a US level forecast with a state-level forecast) from different data objects with different location granularity?

stephenturner commented 3 years ago
  • "invalid location for target. location='11001'": that's the location code for DC. we need to figure out what the correct code should be

"District of Columbia" is 11 right

https://github.com/signaturescience/focustools/blob/230e2bc88bc7c4d969d60db609a4a528b0a61f4a/data-raw/locations.csv#L11

vpnagraj commented 3 years ago

ahh DC is both:

https://github.com/signaturescience/focustools/blob/230e2bc88bc7c4d969d60db609a4a528b0a61f4a/data-raw/locations.csv#L379

11001 must be the county FIPS

need to make a special case to handle that somehow

stephenturner commented 3 years ago

Looks like there are lots of counties with the same name in different states (Washington, Jefferson, Franklin, no surprise). DC looks like the only non-county dupe.

> focustools:::locations %>% 
+   count(location_name, sort=TRUE) %>% 
+   filter(n>1)
# A tibble: 441 x 2
   location_name         n
   <chr>             <int>
 1 Washington County    31
 2 Jefferson County     26
 3 Franklin County      25
 4 Jackson County       24
 5 Lincoln County       24
 6 Madison County       20
 7 Clay County          18
 8 Montgomery County    18
 9 Union County         18
10 Marion County        17
# … with 431 more rows
> focustools:::locations %>% 
+   count(location_name, sort=TRUE) %>% 
+   filter(n>1) %>% 
+   filter(!grepl("County", location_name))
# A tibble: 1 x 2
  location_name            n
  <chr>                <int>
1 District of Columbia     2

I was worried about eg Hawaii (county) vs Hawaii (state) but no problem there.

vpnagraj commented 3 years ago

heads up i think i have a solution for this. pushing up soon ...

vpnagraj commented 3 years ago

edits pushed up to state-level-ts branch to address the location code issues:

vpnagraj commented 3 years ago

i think we're good with the data prep for the state forecasts. just need to make some decisions about which states/territories to submit (#26) and make some minor edits to the pipeline function (#16 )

closing this one for now.

stephenturner commented 3 years ago

This one will probably get reopened from work in #26 if getting quantiles via hilo

stephenturner commented 3 years ago

Actually, handling this in the forecast function so won't have to change this.