Open elray1 opened 7 years ago
Proposal: we continue to ignore backfill for ensemble comparison (flusight-test) but for all CDC-related projects we feed unrevised data to create predictions but revised data to fit the models.
My sense is that the key place to substitute the unrevised data is when we first load the data for the get_log_scores_via_trajectory_simulation
function.
Although it is only loaded so early so that we "know what the dimensions of the results data frame should be." (according to code comments) So might make sense to leave that intact and then read data in within each iteration of the loop over analysis_time_season_week
I think your second suggestion, within each iteration of the loop over analysis_time_season_week
, makes sense. That's because when we do prediction at each analysis_time_season_week
, we need to use a data set that's "everything that was observed up to analysis_time_season_week
". Ideally, I think it would be good to have functionality going forward that can handle either prediction using final observed data (as we have done in the past) or just the data that were available by the analysis_time_season_week
(as we're doing for this project). I guess that could be handled by adding an argument to the get_log_scores_via_trajectory_simulation
function specifying the data set (rather than hard coding in a path), and another argument specifying what type of data set we're giving it?
at least for prospective-prediction years, predictions only use data actually observed by prediction date