county-level incident counts are negative

vpnagraj commented 3 years ago

when we pull data from JHU or NYT we start with cumulative data and untangle that to calculate incident counts

looks like on the county level (and probably on state/national scale too) there are some weeks when the cumulative reporting has been adjusted to have fewer cases or deaths the following week

the result is that you can have negative incident counts. i noticed this at the county level:

dat <- readr::read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv")

ind <- which(names(dat) == "1/22/20")

dat %>%
  tidyr::gather(date, count, dplyr::all_of(ind:ncol(dat))) %>%
  ## drop unnecessary columns
  dplyr::select(-iso2,-code3,-Country_Region) %>%
  dplyr::mutate(date = as.Date(date, format = "%m/%d/%y")) %>%
  dplyr::mutate(epiyear=lubridate::epiyear(date), .after=date) %>%
  dplyr::mutate(epiweek=lubridate::epiweek(date), .after=epiyear) %>%
  dplyr::rename(county = Admin2, fips = FIPS, state = Province_State) %>%
  dplyr::group_by(county, fips, state) %>%
  dplyr::arrange(date) %>%
  ## coerce from cumulative to incident deaths
  ## hold onto count as "cdeaths" for cumulative deaths
  dplyr::mutate(ideaths = count - dplyr::lag(count, default = 0L),
                cdeaths = count) %>%
  filter(fips == "51540") %>%
  filter(between(epiweek, 33,35)) %>%
  select(fips, epiweek, ideaths,cdeaths)

county	state	fips	epiweek	ideaths	cdeaths
Charlottesville	Virginia	51540	33	0	15
Charlottesville	Virginia	51540	33	0	15
Charlottesville	Virginia	51540	33	0	15
Charlottesville	Virginia	51540	33	0	15
Charlottesville	Virginia	51540	33	0	15
Charlottesville	Virginia	51540	33	0	15
Charlottesville	Virginia	51540	33	0	15
Charlottesville	Virginia	51540	34	0	15
Charlottesville	Virginia	51540	34	0	15
Charlottesville	Virginia	51540	34	0	15
Charlottesville	Virginia	51540	34	-1	14
Charlottesville	Virginia	51540	34	0	14
Charlottesville	Virginia	51540	34	0	14
Charlottesville	Virginia	51540	34	0	14
Charlottesville	Virginia	51540	35	2	16
Charlottesville	Virginia	51540	35	0	16
Charlottesville	Virginia	51540	35	1	17
Charlottesville	Virginia	51540	35	0	17
Charlottesville	Virginia	51540	35	0	17
Charlottesville	Virginia	51540	35	0	17
Charlottesville	Virginia	51540	35	1	18

best way to handle this? probably in get_ functions. we can bound ideaths/icases at 0 there. but i'm not sure that we should change cdeaths/ccases given that those are* the reports. but certainly doesn't make sense to count incident cases as negative!

stephenturner commented 3 years ago

I agree, solve it in get_deaths, bound to zero there. Hopefully the fluctuations and reporting corrections are very minor and shouldn't have much impact. Only thing here... when we're getting down to county level granularity we're definitely going to run into the problem you noted before somewhere else, that these things weren't designed for count data. https://otexts.com/fpp3/counts.html suggests this is no problem once you're far enough away from zero, but we won't be at county levels. We might be at state levels with deaths if/when things start to wind down again.

vpnagraj commented 3 years ago

good points. but yeah to clarify, i'm not suggesting we use the TS models at the county level.

i ran into this while playing around with trendeval package and count regression over here https://github.com/signaturescience/focustools/commit/bebb1c23e4c5f7c686af04d9e79eaf8852f45d5c

vpnagraj commented 3 years ago

resolved via #38

signaturescience / focustools

county-level incident counts are negative #31