data prep to translate target forecasts to submission file format

vpnagraj commented 3 years ago

the COVID-19 forecast hub has strict requirements for the forecast submission format:

https://github.com/reichlab/covid19-forecast-hub/blob/master/data-processed/README.md#forecast-file-format

once we generate point / quantile forecasts for targets we need to execute some post-processing to wrangle the data into the required format

this could include:

pivoting wide quantile predictions to long format
translating week/year to epiweek
getting a target_end_date from week
formatting the "n week ahead" target text (e.g. "1 wk ahead inc death")

once we have the submission file format prepped, we can validate locally:

vpnagraj commented 3 years ago

NOTE the data prep of forecast output will depend on the forecast method used ... but given that we are likely starting with a time series model (and using the fable framework) then the prep should work for whatever method we land on for initial implementation (ARIMA, ETS, etc)

stephenturner commented 3 years ago

Initial shot at this in 6bbd2fc. Lots of outstanding issues here.

Needs modularity to take different targets and create the output accordingly.
Something's really, really off with dates. It's some combination of epiweeks to date to yearweek and back to date that's getting weird. I can demonstrate or you can run through that last pipeline sticking a %>% tail() in there to take a look at what I mean.

forecast_date	target	target_end_date	location	type	quantile	value
2020-12-22	1 wk ahead inc case	2020-12-12	US	point	NA	1595791
2020-12-22	2 wk ahead inc case	2020-12-19	US	point	NA	1717279
2020-12-22	3 wk ahead inc case	2020-12-26	US	point	NA	1874981
2020-12-22	4 wk ahead inc case	2021-01-02	US	point	NA	1992527
2020-12-22	1 wk ahead inc case	2020-12-12	US	quantile	0.025	1518044
2020-12-22	2 wk ahead inc case	2020-12-19	US	quantile	0.025	1477875
2020-12-22	3 wk ahead inc case	2020-12-26	US	quantile	0.025	1502830
2020-12-22	4 wk ahead inc case	2021-01-02	US	quantile	0.025	1521864
2020-12-22	1 wk ahead inc case	2020-12-12	US	quantile	0.100	1539116
2020-12-22	2 wk ahead inc case	2020-12-19	US	quantile	0.100	1615797
2020-12-22	3 wk ahead inc case	2020-12-26	US	quantile	0.100	1725278
2020-12-22	4 wk ahead inc case	2021-01-02	US	quantile	0.100	1784281
2020-12-22	1 wk ahead inc case	2020-12-12	US	quantile	0.250	1578533
2020-12-22	2 wk ahead inc case	2020-12-19	US	quantile	0.250	1670508
2020-12-22	3 wk ahead inc case	2020-12-26	US	quantile	0.250	1805991
2020-12-22	4 wk ahead inc case	2021-01-02	US	quantile	0.250	1894356
2020-12-22	1 wk ahead inc case	2020-12-12	US	quantile	0.500	1590791
2020-12-22	2 wk ahead inc case	2020-12-19	US	quantile	0.500	1712935
2020-12-22	3 wk ahead inc case	2020-12-26	US	quantile	0.500	1876995
2020-12-22	4 wk ahead inc case	2021-01-02	US	quantile	0.500	1995373
2020-12-22	1 wk ahead inc case	2020-12-12	US	quantile	0.750	1614622
2020-12-22	2 wk ahead inc case	2020-12-19	US	quantile	0.750	1777245
2020-12-22	3 wk ahead inc case	2020-12-26	US	quantile	0.750	1965876
2020-12-22	4 wk ahead inc case	2021-01-02	US	quantile	0.750	2119008
2020-12-22	1 wk ahead inc case	2020-12-12	US	quantile	0.900	1656737
2020-12-22	2 wk ahead inc case	2020-12-19	US	quantile	0.900	1843507
2020-12-22	3 wk ahead inc case	2020-12-26	US	quantile	0.900	2060315
2020-12-22	4 wk ahead inc case	2021-01-02	US	quantile	0.900	2236633
2020-12-22	1 wk ahead inc case	2020-12-12	US	quantile	0.975	1717527
2020-12-22	2 wk ahead inc case	2020-12-19	US	quantile	0.975	1911562
2020-12-22	3 wk ahead inc case	2020-12-26	US	quantile	0.975	2134267
2020-12-22	4 wk ahead inc case	2021-01-02	US	quantile	0.975	2374745

stephenturner commented 3 years ago

I have this working in some code at f2c3e91

Fit separate models for each outcome (inc cases, inc deaths, cum deaths). (I tried fitting multiple models in the same fit objects with different dependent variables, fable complains: you can't have a mable (model table) with different Y vars). So, for now, different model objectes.
Pass them to the format_fit_for_submission() function. This produces the forecast at the desired horizon, bootstraps each model fit 1000 times, gets the quibble (quantile tibble) for each fit using 23 quantiles, then restricts down to the smaller subset of quantiles if you're looking at inc cases.
Bind rows from each of these function calls together from each metric you're looking at to create the final submission.

fable-submission-mockup-allmetrics.csv.txt (github doesn't let you upload .csv extensions, remove the .txt)

Notes / known issues:

There's still an issue with the date conversion. This should probably be its own issue together with a reprex.
There's some hard-coding of the text formatting to convert "icases" or "cdeaths" into "inc cases" or "cum deaths", etc, happening around here. This is a gross hack, and would probably be better solved by 👇 or pretty much anything besides this method, which depends on the name of the variable you store the model objects in!
There's some redundancy and creation of separate objects for each outcome. This could probably be cleaned up by creating a single fit object, which is a list, with the names of that list being inc cases, inc deaths, and cum deaths. This would potentially also allow for more strict names(fit) %in% ... checking.
Haven't done this yet with anything lower than US-level data.

@vpnagraj run through this code a pipe at a time, see if you have any suggestions.

vpnagraj commented 3 years ago

stepped through what you have (using the *-allmetrics version of the script)

pushed up some edits:

https://github.com/signaturescience/focustools/commit/344d7122ce6435ff81fcfd4d48e2664c20c2d681

i think i have a candidate fix for the text formatting conversion of "icases" to "inc cases" ... just pass in a new argument for target_name ? seems to be working

also played around with the dates a little bit. agreed that something is way off. i reworked your code, thought i had fixed the issue (to get the epiweek date starting on sunday instead of monday) but now that i'm looking at this issue again it looks the same as your comment above (https://github.com/signaturescience/focustools/issues/6#issuecomment-749668242)

🤔

im wondering if get_cases() and our exclusion of last week (because it's incomplete) is throwing a wrench here ...

vpnagraj commented 3 years ago

@stephenturner FYI looks like get_cases() and get_deaths() did include logic to remove the current week. that same (or similar) logic was implemented in in the TS modeling code:

https://github.com/signaturescience/focustools/blob/master/scratch/fable-submission-mockup-allmetrics.R#L29

i think it's better to do handle it that way ^ ... ie lets drop the current week exclusion from get_cases() and get_deaths()

done in https://github.com/signaturescience/focustools/commit/4aa7bdb20ec5b3dd392d2ad9f721818ae539f680

so that saved us one week of data. we're still bumping into the issue with horizon being k + 1 week (current week that we can't/shouldn't use in modeling because it is incomplete)

need to keep thinking on this ...

stephenturner commented 3 years ago

I'm still cracking at this. I think the problem comes in with mmwrweeks being converted to dates, then to yearweeks, then back to dates.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Saturday
MMWRweek::MMWRweek("2020-12-26")$MMWRweek
#> [1] 52
# Sunday
MMWRweek::MMWRweek("2020-12-27")$MMWRweek
#> [1] 53
# Monday
MMWRweek::MMWRweek("2020-12-28")$MMWRweek
#> [1] 53

# Sunday
MMWRweek::MMWRweek2Date(2020, 53, MMWRday = 1)
#> [1] "2020-12-27"
MMWRweek::MMWRweek2Date(2020, 53, MMWRday = 1) %>% tsibble::yearweek()
#> <yearweek[1]>
#> [1] "2020 W52"
#> # Week starts on: Monday
MMWRweek::MMWRweek2Date(2020, 53, MMWRday = 1) %>% tsibble::yearweek() %>% lubridate::as_date()
#> [1] "2020-12-21"

Still trying to craft that reprex.

stephenturner commented 3 years ago

I pushed some code in a new script at 6ff70b4. Run through that. I think the ~best~ oversimplified approach here might be simply adding a +6 or -7 or whatever somewhere.

stephenturner commented 3 years ago

I the date issue is fixed now. I'm creating the tsibble with a function that adds a monday column, which is the monday of that epiweek, and bases the yearweek tsibble index column based on that week. Later, after modeling/forecasting, I get the as_date() of that yweek, which returns the monday of that (1, 2, 3, or 4) week ahead forecast, and +days(5) to get the saturday that ends that epiweek.

From the https://github.com/reichlab/covid19-forecast-hub#ensemble-model section:

For inclusion in the ensemble, we additionally require that forecasts include a full set of 23 quantiles to be submitted for each of the one through four week ahead values for forecasts of deaths, and a full set of 7 quantiles for the one through four week ahead values for forecasts of cases (see technical README for details), and that the 10th quantile of the predictive distribution for a 1 week ahead forecast of cumulative deaths is not below the most recently observed data.

I don't think the current forecasts based on the auto ARIMA models are doing this, but we should probably add a check/correction for this case, that if the 10th quantile of any cumulative forecast is below the most recently observed data, then make it equal to the most recent observed data, at a minimum.

stephenturner commented 3 years ago

This check for forecasts for cumulative deaths not below current week values is now implemented in ae43487. But I haven't yet figured out the best place for this to reside, functionally. The format_fit_for_submission Takes as input the model table (output from model()), and doesn't actually take any data as input. The current week's cumulative death value actually resides in the data. If we wrote one monster function that did both modeling, forecasting, and formatting, we could do this here, because that function would have to take the data as input, not the models. But I kind of like keeping them separate for now, because it makes tinkering around the the modeling a bit easier, doing it outside of some monster function call. Anyway, for now, the bolt-on fix in ae43487 works, and we can sort out how to best modularize/functionalize this later. @vpnagraj if you wouldn't mind, run through the script https://github.com/signaturescience/focustools/blob/master/scratch/fable-submission-mockup-allmetrics.R to see if this all looks legit to you.

stephenturner commented 3 years ago

I added some code in cfdd1e8e1813dc558215d43e0a764c0340e8a130 to use the script added in 63a2ff968a69e61f8c6b7c7d50f630c0d6d2bb8e to validate the submission.

> forecast_filename <- here::here("scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv")
> validate_file(forecast_filename)

 Validating /Users/sturner/sigsci/irad/focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv ...
VALIDATED: filename 
VALIDATED: column names
VALIDATED: no NA values
Warning in verify_targets(entry) :
  ERROR: Some entries in `targets` do not correspond to standards:1 wk ahead cum deaths, 1 wk ahead inc cases, 1 wk ahead inc deaths, 2 wk ahead cum deaths, 2 wk ahead inc cases, 2 wk ahead inc deaths, 3 wk ahead cum deaths, 3 wk ahead inc cases, 3 wk ahead inc deaths, 4 wk ahead cum deaths, 4 wk ahead inc cases, 4 wk ahead inc deaths
VALIDATED: date format 
VALIDATED: forecast_date, target_end_date
VALIDATED: no quantile crossing
VALIDATED: temporal monotonicity
VALIDATED: cum geq inc
VALIDATED: entries of `quantile`

So, everything seems to look okay except for the targets.

The code https://github.com/reichlab/covid19-forecast-hub/blob/68df08d9e6e19d55fddab4bd5abb505202023ecb/code/validation/R-scripts/functions_plausibility.R#L169-L186, checks for 1, 2, 3, 4 wk ahead inc death and cum deaths, but doesn't allow for inc cases:

#' Checking that all entries in `target` correspond to standards
#'
#' @param entry the data.frame
#'
#' @return invisibly returns TRUE if problems detected, FALSE otherwise
verify_targets <- function(entry){
  allowed_targets <- c(
    paste(0:130, "day ahead inc death"),
    paste(0:130, "day ahead cum death"),
    paste(0:20, "wk ahead inc death"),
    paste(0:20, "wk ahead cum death"),
    paste(0:130, "day ahead inc hosp")
  )
  targets_in_entry <- unique(entry$target)
  if(!all(targets_in_entry %in% allowed_targets)){
    warning("ERROR: Some entries in `targets` do not correspond to standards:",
            paste0(targets_in_entry[!(targets_in_entry %in% allowed_targets)], collapse = ", "))
    return(invisible(FALSE))
  }else{
    cat("VALIDATED: targets\n")
    return(invisible(TRUE))
  }
}

This doesn't jive with what I thought was required here to be included in the ensemble forecast (https://github.com/reichlab/covid19-forecast-hub/tree/master/data-processed#target). Perhaps this R code is no longer maintained. According to the documentation at https://github.com/reichlab/covid19-forecast-hub/blob/master/data-processed/R_forecast_file_validation.md,

For those familiar with R (but not python), there is a separate set of tests that may be useful to diagnose data formatting issues in functions_plausibility.R. We have tried to keep these in sync with the python checks automatically run during a pull request, but have now stopped maintaining the checks in R. They are kept in the repository merely as an additional resource for teams who work exclusively with R. If you discover major discrepancies, you can nonetheless let us know and we may address them as time permits.

... in fact, after digging around a little bit, it seems like this is the case!

That R script, https://github.com/reichlab/covid19-forecast-hub/blob/master/code/validation/R-scripts/functions_plausibility.R, was last updated in May. According to the README, https://github.com/reichlab/covid19-forecast-hub/tree/master/data-processed#removed-targets, N day ahead inc cases was removed in June.

stephenturner commented 3 years ago

To the script in our utils/ folder, I added N wk ahead in case to the allowed targets in af56e43.

This let the results pass that validation check (after changing 'deaths' to 'death' and 'cases' to 'case' in 5de29cdd4f9e02240e875a35472f20a0b9239105). But another validation effort failed:

> validate_file(forecast_filename)

 Validating /Users/sturner/sigsci/irad/focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv ...
VALIDATED: filename 
VALIDATED: column names
VALIDATED: no NA values
VALIDATED: targets
VALIDATED: date format 
VALIDATED: forecast_date, target_end_date
 Error in if (any(is_crossing)) { : missing value where TRUE/FALSE needed

I dug into the validation scripts and there's a place right around here https://github.com/reichlab/covid19-forecast-hub/blob/68df08d9e6e19d55fddab4bd5abb505202023ecb/code/validation/R-scripts/functions_plausibility.R#L259-L282 where it checks for "quantile crossing". I'm not exactly sure what this is doing yet, but I think what's causing a problem here is that some targets have different quantiles required than others. inc deaths and cum deaths require a larger set of quantiles, while N wk ahead inc case (the newly added target) requires only a subset of those quantiles. This is spelled out in the data submission readme here.

I think this causes a problem with this old legacy code because one of the operations it performs is a widening reshape, and when there are some targets with a subset of quantiles compared to other targets, you end up with NAs in the wide matrix. I still don't fully understand what this check is looking for, but I silenced this validation problem in a2cd111 by omitting NAs from this crossing check. All the others were FALSE. This obviates the Error seen above, and all validation checks pass.

> validate_file(forecast_filename)

 Validating /Users/sturner/sigsci/irad/focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv ...
VALIDATED: filename 
VALIDATED: column names
VALIDATED: no NA values
VALIDATED: targets
VALIDATED: date format 
VALIDATED: forecast_date, target_end_date
VALIDATED: no quantile crossing
VALIDATED: temporal monotonicity
VALIDATED: cum geq inc
VALIDATED: entries of `quantile`

CAVEAT: This works, but given the hacks I had to put into place to get this working, I'd recommend we either:

Switch to the officially supported instructions for validating locally, https://github.com/reichlab/covid19-forecast-hub/wiki/Running-Checks-Locally
Or else look around to see if someone else has forked and kept this R code up to date.

If we can find #-2 above, it sure would be more lightweight than going the #-1 route, which requires updating the upstream of the fork, installing some python pkgs, etc. Perhaps it isn't as burdensome as I think. I'll give it a spin on darwin if I can before our meeting today.

stephenturner commented 3 years ago

Follow up -- #-1 is pretty trivial. I set up a new conda environment, and followed the instructions at https://github.com/reichlab/covid19-forecast-hub/wiki/Running-Checks-Locally to install requirements and validate a single forecast file.

On darwin:

(focus) sturner@darwin:/data/projects/focus/covid19-forecast-hub$ python3 code/validation/validate_single_forecast_file.py ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv

VALIDATING ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv
✓ ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv is valid with no errors

🎉 🥳 🌟 ✔️

vpnagraj commented 3 years ago

@stephenturner parallel thought here ...

what if we put that the python validation script / pkgs in a docker image ... and wrapped a call to taht docker image in an R function (i.e. using something like stevedore) ?

i can help with that if want to pursue. shouldn't be too big of a lift. BUT we'd obviously still need to makes sure that validation code stays current

stephenturner commented 3 years ago

I'd almost always prefer to call an R function than issue a python command/script at the bash shell. Looks like the requirements are pretty minimal.

https://github.com/reichlab/covid19-forecast-hub/blob/master/visualization/requirements.txt

vpnagraj commented 3 years ago

agreed. see https://github.com/signaturescience/focustools/issues/9

vpnagraj commented 3 years ago

@stephenturner heads up i've heavily refactored the scratch submission mockup code:

https://github.com/signaturescience/focustools/blob/master/scratch/fable-submission-mockup-allmetrics.R

things to note:

the new ts_forecast() function now sits outside of format_fit_for_submission() (i think we can be a litte more nimble this way)
ts_forecast() accepts horizon AND "new_data" args ... new_data is what fable needs for forecasts that require other covariates ... if NULL (default) then the new_data will be ignored. the way its written now, ts_forecast() should work for either forecasting with/without new_data
i added a "seed" argument to ts_forecast() so that i could validate that the forecasts matched what you were generating previously (checked before i changed the ideaths to be predicted by lagged cases). probably a good idea to keep that in there
cdeaths is still being forecast using an ARIMA ... so we still need to figure out a way to get use the ideaths forecast to arrive at cdeaths (AND still get the quibble format)
i ran into an issue with validating the sumbission file generated with the current date (2020-01-05) ... see message below. the workaround was to force the forecast date and filename to use yesterday (2020-01-04), after which validation succeeded.

VALIDATING ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-05-sigsci-ts.csv
✘ Error in ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-05-sigsci-ts.csv. Error(s):
 ["target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'point', 'NA', '369373']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.01', '367031']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.025', '367118']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.05', '367358']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.1', '367969']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.15', '368250']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.2', '368334']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.25', '368688']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.3', '369017']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.35', '369038']", 'target_end_date was ...']

any thoughts on that ^ ?

stephenturner commented 3 years ago

I don't know, unless it expected the target end date for 1 week ahead to end on the following saturday if you're dating the forecast after monday? I feel like I've seen something to this effect in the docs. Let me dig.

vpnagraj commented 3 years ago

sheesh.

well maybe thats OK? i mean im working on writing the validation wrapper for the python method now. we can stick to validating only before we are ready to submit on the sunday or monday. so as long as we generate the forecasts/validate on sunday or monday (before deadline) it should be fine? i think?

stephenturner commented 3 years ago

See #26. Reopening because there's currently a line hard-coding "US" as the location:

https://github.com/signaturescience/focustools/blob/e600847353be71b53886689b3c3af147bc247d97/R/submission.R#L72

dplyr::mutate(location="US", forecast_date=lubridate::today())

This will not allow for state or county-level granularity.

vpnagraj commented 3 years ago

@stephenturner see https://github.com/signaturescience/focustools/blob/state-level-ts/R/submission.R#L72

i removed the location="US" that was hardcoded in there. the forecast object should include a location column generate with get_cases() / get_deaths():

granularity="national" the value will be "US"
granuarlity="state" the value will be full state name
granulairty="county" the value will be county fips code

we do need to convert the state/territory name to appropriate FIPS:

https://github.com/signaturescience/focustools/blob/state-level-ts/R/submission.R#L72

i think that will be a simple join to focustools:::locations somewhere in format_for_submission() ?

vpnagraj commented 3 years ago

sorry to steamroll you here @stephenturner but i'm cooking on this state level stuff!

i just pushed up an edit to format_for_submission() that addresses the location join

that piece seems to be working now. mostly.

i'm seeing the following issues in validate_forecast() (full output at bottom of this comment):

"entries in the value column must be non-negative": these are state/territory forecasts that come through as negative. my guess is that most of the negative values are from models of territories where there are few cases (for example, location '78' is Virgin Islands and is one that has negative values predicted). so it's really an issue with models themselves, not necessarily formatting (#26 ). unless we want to stick a condition in format_for_submission() that bounds all values at min 0? i kind of think that should go elsewhere ...
"invalid location for target. location='11001'": that's the location code for DC. we need to figure out what the correct code should be
"target_end_date was not the expected Saturday." : that's because i'm running on a tuesday ...

[1] "entries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '02', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '66', 'quantile', '0.05', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '78', 'quantile', '0.01', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '78', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '78', 'quantile', '0.05', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '02', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '66', 'quantile', '0.05', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '78', 'quantile', '0.01', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '78', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '3 wk ahead inc death', '2021-02-06', '02', 'quantile', '0.025', '-1']\nentries in the `valu...\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'point', 'NA', '1426']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.01', '877']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.025', '881']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.05', '885']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.1', '887']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.15', '888']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.2', '889']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.25', '890']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.3', '890']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.35', '892']\ninvalid location for...\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'point', 'NA', '6533']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.01', '6249']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.025', '6249']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.05', '6391']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.1', '6401']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.15', '6408']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.2', '6432']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.25', '6445']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.3', '6446']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.35', '6446']\ntarget_end_date was ..."

stephenturner commented 3 years ago

unless we want to stick a condition in format_for_submission() that bounds all values at min 0? i kind of think that should go elsewhere ...

Bound it at zero for now. We could get more sophisticated... for cum deaths we would bound point and all quantiles at no less than the last week's current data. Inc death/cases- seems reasonable that the +1wk ahead should be no less than 2x the difference between 0 and -1wk. Or +2wk ahead should be no less than 2x difference between 0 and -2x. And still bounded at zero. I.e., enforcing that you can't drop incident cases/deaths more than twice as much as they changed in a previous horizon backward?

Where to do it? Agree doesn't really belong in a formatting script. But the ts_forecast doesn't yet track the data (#17), so you'd have to supply that as an arg there. Perhaps some final thing after formatting for submission, something like bound_submission(submission, data)? Although that could get tricky with submissions with multiple location granularities (eg from a bind_rows on a US level forecast with a state-level forecast) from different data objects with different location granularity?

stephenturner commented 3 years ago

"invalid location for target. location='11001'": that's the location code for DC. we need to figure out what the correct code should be

"District of Columbia" is 11 right

https://github.com/signaturescience/focustools/blob/230e2bc88bc7c4d969d60db609a4a528b0a61f4a/data-raw/locations.csv#L11

vpnagraj commented 3 years ago

ahh DC is both:

https://github.com/signaturescience/focustools/blob/230e2bc88bc7c4d969d60db609a4a528b0a61f4a/data-raw/locations.csv#L379

11001 must be the county FIPS

need to make a special case to handle that somehow

stephenturner commented 3 years ago

Looks like there are lots of counties with the same name in different states (Washington, Jefferson, Franklin, no surprise). DC looks like the only non-county dupe.

> focustools:::locations %>% 
+   count(location_name, sort=TRUE) %>% 
+   filter(n>1)
# A tibble: 441 x 2
   location_name         n
   <chr>             <int>
 1 Washington County    31
 2 Jefferson County     26
 3 Franklin County      25
 4 Jackson County       24
 5 Lincoln County       24
 6 Madison County       20
 7 Clay County          18
 8 Montgomery County    18
 9 Union County         18
10 Marion County        17
# … with 431 more rows
> focustools:::locations %>% 
+   count(location_name, sort=TRUE) %>% 
+   filter(n>1) %>% 
+   filter(!grepl("County", location_name))
# A tibble: 1 x 2
  location_name            n
  <chr>                <int>
1 District of Columbia     2

I was worried about eg Hawaii (county) vs Hawaii (state) but no problem there.

vpnagraj commented 3 years ago

heads up i think i have a solution for this. pushing up soon ...

vpnagraj commented 3 years ago

edits pushed up to state-level-ts branch to address the location code issues:

counties with same name in different states shouldn't be an issue (since we use their FIPS not location_name)
i moved the code to join to the locations object into the get_ functions and am only joining for states/territories
for the DC fips issue ... i'm just removing the DC fips from the locations in generate_sysdata.R (see https://github.com/signaturescience/focustools/blob/state-level-ts/data-raw/generate_sysdata.R#L6-L9)
i also added a str_pad to make sure county FIPS are 5 digit

vpnagraj commented 3 years ago

i think we're good with the data prep for the state forecasts. just need to make some decisions about which states/territories to submit (#26) and make some minor edits to the pipeline function (#16 )

closing this one for now.

stephenturner commented 3 years ago

This one will probably get reopened from work in #26 if getting quantiles via hilo

stephenturner commented 3 years ago

Actually, handling this in the forecast function so won't have to change this.

signaturescience / focustools

data prep to translate target forecasts to submission file format #6