signaturescience / focustools

Forecasting COVID-19 in the US
https://signaturescience.github.io/focustools/
GNU General Public License v3.0
0 stars 0 forks source link

county-level incident counts are negative #31

Closed vpnagraj closed 3 years ago

vpnagraj commented 3 years ago

when we pull data from JHU or NYT we start with cumulative data and untangle that to calculate incident counts

looks like on the county level (and probably on state/national scale too) there are some weeks when the cumulative reporting has been adjusted to have fewer cases or deaths the following week

the result is that you can have negative incident counts. i noticed this at the county level:

dat <- readr::read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv")

ind <- which(names(dat) == "1/22/20")

dat %>%
  tidyr::gather(date, count, dplyr::all_of(ind:ncol(dat))) %>%
  ## drop unnecessary columns
  dplyr::select(-iso2,-code3,-Country_Region) %>%
  dplyr::mutate(date = as.Date(date, format = "%m/%d/%y")) %>%
  dplyr::mutate(epiyear=lubridate::epiyear(date), .after=date) %>%
  dplyr::mutate(epiweek=lubridate::epiweek(date), .after=epiyear) %>%
  dplyr::rename(county = Admin2, fips = FIPS, state = Province_State) %>%
  dplyr::group_by(county, fips, state) %>%
  dplyr::arrange(date) %>%
  ## coerce from cumulative to incident deaths
  ## hold onto count as "cdeaths" for cumulative deaths
  dplyr::mutate(ideaths = count - dplyr::lag(count, default = 0L),
                cdeaths = count) %>%
  filter(fips == "51540") %>%
  filter(between(epiweek, 33,35)) %>%
  select(fips, epiweek, ideaths,cdeaths)
county state fips epiweek ideaths cdeaths
Charlottesville Virginia 51540 33 0 15
Charlottesville Virginia 51540 33 0 15
Charlottesville Virginia 51540 33 0 15
Charlottesville Virginia 51540 33 0 15
Charlottesville Virginia 51540 33 0 15
Charlottesville Virginia 51540 33 0 15
Charlottesville Virginia 51540 33 0 15
Charlottesville Virginia 51540 34 0 15
Charlottesville Virginia 51540 34 0 15
Charlottesville Virginia 51540 34 0 15
Charlottesville Virginia 51540 34 -1 14
Charlottesville Virginia 51540 34 0 14
Charlottesville Virginia 51540 34 0 14
Charlottesville Virginia 51540 34 0 14
Charlottesville Virginia 51540 35 2 16
Charlottesville Virginia 51540 35 0 16
Charlottesville Virginia 51540 35 1 17
Charlottesville Virginia 51540 35 0 17
Charlottesville Virginia 51540 35 0 17
Charlottesville Virginia 51540 35 0 17
Charlottesville Virginia 51540 35 1 18

best way to handle this? probably in get_ functions. we can bound ideaths/icases at 0 there. but i'm not sure that we should change cdeaths/ccases given that those are* the reports. but certainly doesn't make sense to count incident cases as negative!

stephenturner commented 3 years ago

I agree, solve it in get_deaths, bound to zero there. Hopefully the fluctuations and reporting corrections are very minor and shouldn't have much impact. Only thing here... when we're getting down to county level granularity we're definitely going to run into the problem you noted before somewhere else, that these things weren't designed for count data. https://otexts.com/fpp3/counts.html suggests this is no problem once you're far enough away from zero, but we won't be at county levels. We might be at state levels with deaths if/when things start to wind down again.

vpnagraj commented 3 years ago

good points. but yeah to clarify, i'm not suggesting we use the TS models at the county level.

i ran into this while playing around with trendeval package and count regression over here https://github.com/signaturescience/focustools/commit/bebb1c23e4c5f7c686af04d9e79eaf8852f45d5c

related to https://github.com/signaturescience/focustools/issues/27

vpnagraj commented 3 years ago

resolved via #38