signaturescience / focustools

Forecasting COVID-19 in the US
https://signaturescience.github.io/focustools/
GNU General Public License v3.0
0 stars 0 forks source link

Track data used across forecasting #17

Closed stephenturner closed 3 years ago

stephenturner commented 3 years ago

fable::model and focustools::ts_fit both take data from make_tsibble as input. ts_sumulative_forecast requires the incident forecast, as well as the data used to create that incident forecast. There's a check in there that the first forecasted week is 1 week ahead of the last week of recorded data in the supplied dataset, and this should catch egregious errors supplying the wrong data that wasn't used to generate the inc forecast. This could be done with something like this in the ts_fit function

attr(icases_forecast, ".data") <- deparse_substitute(.data)

Then later something like this in the ts_sumulative_forecast

stopifnot(deparse(substitute(.data))==attr(inc_forecast, ".data"))

Doesn't seem to change the tibble that's output. Example:

return_data_with_attr <- function(.data) {
  out <- tibble::as_tibble(head(.data))
  attr(out, ".data") <- deparse(substitute(.data))
  return(out)
}
> irishead <- return_data_with_attr(iris)
> irishead
# A tibble: 6 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa 
> attr(irishead, ".data")
[1] "iris"
stephenturner commented 3 years ago

So it seems like the best place to put this would be in ts_fit but you can't use the TSLM with lag 3 (according to the readme, I still haven't dived in yet to grok why). As of now, both in the walkthrough in the readme and in the pipeline function, we're calling model() directly. If we were calling model() inside a function call in the package we could tack on this attribute easily, and this attribute could be picked up in ts_forecast. Any thoughts on an immediate solution to this one @vpnagraj or should we keep this one open for now and not tackle it just yet? It's a potential gotcha that probably won't bite us now, but may as we develop subnational forecasts with different data objects.

vpnagraj commented 3 years ago

so unless i'm missing it, the code above just checks the name of the object, right? it's not actually checking equivalent values in the data? to do that i guess we could make ts_forecast()return a list with the forecast tibble and the data as elements. slippery slope until we're basically writing our own s3 class for this stuff.

to be honest, thinking about this now i'm not sure how dangerous this is. even in the state-level forecasts, how would we get into a situation where we were making cumulative forecast from different data than we used for the incident forecast? especially with the one-stop forecast_pipeline()? i mean the possibility of that sort of goof will always be there interactively. but i'm having a hard time realistically imaging this biting us, especially as we automate more and more.

still bugs me. let's leave the issue open. maybe revisit after we get https://github.com/signaturescience/focustools/issues/26 working? and if we ever publish the pkg then i definitely want to address this.

stephenturner commented 3 years ago

Yep, just the name. Perhaps an in-between just storing the name versus storing the actual data in a list would be to store a hash eg with digest of the data tibble, and ensure the hashes match.

return_data_with_attr <- function(.data) {
  out <- tibble::as_tibble(head(.data))
  attr(out, ".data") <- digest::digest(.data)
  return(out)
}

irishead <- return_data_with_attr(iris)
attr(irishead, ".data")
#> [1] "d3c5d071001b61a9f6131d3004fd0988"

But, I think I'm in agreement with you all around. (A) this isn't that dangerous now, but (B) when we make public or try to publish, this wouldn't be terribly hard to implement I don't think.

Punt to post #26

vpnagraj commented 3 years ago

going to close this one out. havent run into any issues losing track of data between forecasts yet. function docs do provide a reminder that data should match:

https://github.com/signaturescience/focustools/blob/master/R/ts.R#L158

if needed we can reopen later