signaturescience / focustools

Forecasting COVID-19 in the US
https://signaturescience.github.io/focustools/
GNU General Public License v3.0
0 stars 0 forks source link

exploratory count regression code and fix for negative county-level incidence #38

Closed vpnagraj closed 3 years ago

vpnagraj commented 3 years ago

putting this PR mostly to get the fix for negative county-level incidence into master

that issue is described in detail at (#31 )

the fix was implemented in get_cases() and get_deaths() before aggregating at county/state/national ...

what that means is that this could (and i think does!) affect the national level forecasts. in fact when i looked at this today, the national forecast was on average ~ 5% different than what we submitted. generally estimates were higher.

the reason? i think the negative counts are from counties adjusting their cumulative counts (again described in detail and with an example at #31 ). when we sum up the negative counts we are including those values ... maybe there are enough of them that it leads to underestimated counts => lower forecasts? maybe contributing to #29 ?

@stephenturner i could use your brain here. assigned you for review. main thing i want to confirm is yes/no we should be forcing the negative counts to 0. i think we should ...

after you have a chance to look at it (relevant code in https://github.com/signaturescience/focustools/commit/b6576c26a6717d584e01ec39cd1d5acf6564c18a) go ahead and merge and leave the count-reg branch open. i'll continue to update my count regression exploration there (scratch/count-regression.R)

vpnagraj commented 3 years ago

so there is clearly something going on with the day-to-day counts reported in some locations. some places seem to "correct" counts from one day to another.

one thing i hadn't really acknowledged yet was the fact that we are only interested in weekly counts here.

in other words, the cumsum is happening after we group by week and sum up incident cases. so maybe some of the variability day-to-day shouldn't be removed? meaning, the correction for negative counts should only happen after we've grouped by and cumsum'd weeks? another thought is that since this is really only a "breaking" issue with county level forecasts maybe we should just correct those and leave the state/national behavior as-is ... at least until we are 100% sure that the correction is appropriate.

stephenturner commented 3 years ago

Good points - I'd assume the variability even at county level would be stabilized after summing by week, so if correcting after summarizing by epiweek, while it still might inflate incidence, it wouldn't be nearly as bad as setting to zero by day then summing. So, I think it's safe to do this. Esp at state and US levels. Might be worth a look at cumulative cases (not zeroing) versus cumulative cases (zeroing).

vpnagraj commented 3 years ago

cool. i think im going to call it and merge this. what i did was move that "bounding to zero" stuff to the end of get_cases() and get_deaths() ... meaning if for some reason any of the icases/ideaths were negative at weekly aggregate then force to be zero. we were seeing this for the charlottesville ideaths (see #31 )! not any more.

anyways, this change (having the bound happen at the end versus the beginning of that get_ data retrieval) might seem minor ... but it looks like it does influence the forecasts at national level. i think this is the most appropriate way to manage.

it would be really hard to disentangle some of the day-to-day reporting inconsistency. at minimum we should never be returning weekly incidence that's negative. and this makes sure that wont happen.

ok. onto the next challenge!