signaturescience / focustools

Forecasting COVID-19 in the US
https://signaturescience.github.io/focustools/
GNU General Public License v3.0
0 stars 0 forks source link

`get_*` utility functions #3

Closed vpnagraj closed 3 years ago

vpnagraj commented 3 years ago

moving forward we should abstract our code for retrieving case, death, hospitalization (and other data types) to a series of utility functions

to start let's focus on get_cases() and get_deaths()

each function could have an arg for "source" (NYT, JHU, etc) and potentially also an argument for granularity

these get_* data functions would wrap other helpers (like epiweek translation) and return a prepped tibble (or list of tibbles if we wanted one for incident and cumulative cases/deaths) that we could then put into our modeling framework(s)

Also, from #1:

If later we're going to do analyses at the state and county level, we should ensure that we can get the same counts after summing over counties as we get by reading in the pre-summed data, regardless of whether we do this with NYT or some other data source.

vpnagraj commented 3 years ago

underway via https://github.com/signaturescience/focustools/commit/c19ac3e878d5cfaacabffd41377fe0809d50c7b7

vpnagraj commented 3 years ago

the get_cases() and get_deaths() functions are coming together ...

idea is that these will provide a standard interface for retrieving count data

as of https://github.com/signaturescience/focustools/commit/265de9938f454090c667d0b25adf672b3d831263 each function now accepts an argument for "source" and "granularity". currently the source must be either "nyt" or "jhu" and the granularity must be "national", "state", or "county" (although only "national" will work for source = "nyt" right now)

more documentation coming soon

also still need to verify the results (both incident and cumulative counts) coming out of each of these functions.

stephenturner commented 3 years ago

Slick. I'll pull and test when I get a chance.

stephenturner commented 3 years ago

Something's not exactly lining up at least with the nyt data. The util script I wrote doesn't give you functions, it just gives you data. Let's move this into scratch or get rid of it once we fix this because I prefer calling a function rather than sourcing a script to get data. But either way, numbers aren't matching up.

suppressPackageStartupMessages(library(tidyverse))
source("utils/get-us-data.R")
source("utils/get_data.R")

# from my script
usa

usa2 <- get_cases(source = "nyt", granularity = "national")

ujoin <- inner_join(usa %>% select(epiweek, icases),
                    usa2 %>% select(epiweek, icases),
                    by="epiweek", suffix=c(".sdt", ".vpn"))

ujoin %>% ggplot(aes(icases.sdt, icases.vpn)) + geom_point()

image

stephenturner commented 3 years ago

Lots of messages coming in trying to read jhu data and sum to national. The messages could be suppressed by being explicit about column types, but the last bit was new to me. Looks like this is described at https://tidyselect.r-lib.org/reference/faq-external-vector.html, with documentation on all_of at https://tidyselect.r-lib.org/reference/all_of.html.

> usa3 <- get_cases(source="jhu", granularity = "national")
Parsed with column specification:
cols(
  .default = col_double(),
  iso2 = col_character(),
  iso3 = col_character(),
  Admin2 = col_character(),
  Province_State = col_character(),
  Country_Region = col_character(),
  Combined_Key = col_character()
)
See spec(...) for full column specifications.
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(ind)` instead of `ind` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
stephenturner commented 3 years ago

Digging into the comment two up, comparing JHU data to NYT data using your get data script lines up much more nicely than the NYT data coming from my script.

usa3 <- get_cases(source="jhu", granularity = "national")

# jhu vs nyt using get_data functions
inner_join(usa3 %>% select(epiweek, icases.jhu=icases),
           usa2 %>% select(epiweek, icases.nyt=icases), 
           by="epiweek") %>% 
  ggplot(aes(icases.nyt, icases.jhu)) + geom_point()

# jhu vs nyt using get-us-data.R script
inner_join(usa3 %>% select(epiweek, icases.jhu=icases),
           usa %>% select(epiweek, icases.nyt=icases), 
           by="epiweek") %>% 
  ggplot(aes(icases.nyt, icases.jhu)) + geom_point()

From your functions 😄

image

From my script 😦

image

So my guess, either I'm doing something wrong with processing the NYT data, or you're doing something wrong on both. I trust my own quick hack less.

vpnagraj commented 3 years ago

hmmm.

re: the messages yeah we can/should clean that up by passing column names in read_ function ... i'll also take a closer look at that all_of(ind) message (i think that is stemming from some kind of nasty code i had to use to reshape the tibble https://github.com/signaturescience/focustools/blob/master/utils/get_data.R#L11-L15)

as for validating the counts ...

i was hoping we could find another source that reported the nyt or jhu data by week (and ideally by the same granularity we use in get_cases() and get_deaths())

one option is the C19FH weekly report for deaths:

https://covid19forecasthub.org/reports/2020-12-15-weekly-report.html

i need to look at this a lot more closely but i don't think we're far off? at least with get_deaths()?

jhu_deaths_national <- get_deaths(source = "jhu", granularity = "national")
jhu_deaths_national %>%
  tail() 
epiyear epiweek ideaths cdeaths
2020 45 6971 238193
2020 46 7609 245802
2020 47 10262 256064
2020 48 10095 266159
2020 49 15121 281280
2020 50 16562 297842
vpnagraj commented 3 years ago

i think get_cases() and get_deaths() are working.

one outstanding feature is the "cache" option, which would allow us to precompute case/death data and conditionally load that instead of going out to the API.

i'm going to leave this open but mark the corresponding "Initial Method and Dev" task as done ... i think i can add this to another (like "Pipeline Automation") later

stephenturner commented 3 years ago

Sounds good. I didn't post this but the other day I took a look at i/c cases/deaths scatterplots from nyt vs jhu. they're very close. For now I've been using NYT just because it's much faster to pull down, being already summarized, but the cache option might help when drilling down further later on.

stephenturner commented 3 years ago

When looking into #26 lots of warnings are thrown when trying to get state-level data (deaths and cases)

> usad <- get_deaths(source="jhu",  granularity = "state")

── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  .default = col_double(),
  iso2 = col_character(),
  iso3 = col_character(),
  Admin2 = col_character(),
  Province_State = col_character(),
  Country_Region = col_character(),
  Combined_Key = col_character()
)
ℹ Use `spec()` for the full column specifications.

There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
2: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid
3: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
4: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid
5: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
6: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid
vpnagraj commented 3 years ago

@stephenturner weird. i just ran the same and could not reproduce the warnings ...

i guess this is part of the issue with pulling data that is being updated directly from GH. and maybe even more motivation for the cache argument.

speaking of, before we add the cache feature (by pre-pulling data in generate_sysdata.R and saving as internal obj) we need to confirm the license for redistributing the JHU / NYT data sets

stephenturner commented 3 years ago

Weird... Here's the relevant bit of code that's giving me the issue. The last bit subsets to epiweek 53, showing the problem.

dat <- readr::read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")

ind <- which(names(dat) == "1/22/20")

dat <-
  dat %>%
  tidyr::gather(date, count, all_of(ind:ncol(dat))) %>%
  ## drop unnecessary columns
  dplyr::select(-iso2,-code3,-Country_Region) %>%
  dplyr::mutate(date = as.Date(date, format = "%m/%d/%y")) %>%
  dplyr::mutate(epiyear=MMWRweek::MMWRweek(date)$MMWRyear, .after=date) %>%
  dplyr::mutate(epiweek=MMWRweek::MMWRweek(date)$MMWRweek, .after=epiyear) %>%
  dplyr::rename(county = Admin2, fips = FIPS, state = Province_State) %>%
  dplyr::group_by(county, fips, state) %>%
  dplyr::arrange(date) %>%
  ## coerce from cumulative to incident cases
  ## hold onto count as "ccases" for cumulative cases
  dplyr::mutate(icases = count - dplyr::lag(count, default = 0L),
                ccases = count)

dat %>%
  ## within each state, year, week grouping
  dplyr::group_by(state,epiyear,epiweek) %>%
  ## sum up the number of deaths at that week
  dplyr::summarise(icases = sum(icases, na.rm = TRUE), .groups = "drop") %>%
  ## ignore the current week because it will likely be incomplete ...
  dplyr::filter(epiweek != lubridate::week(Sys.Date())) %>%
  ## create a variable for date using epiyear and week
  dplyr::filter(epiweek==53 | epiweek==52) %>% 
  dplyr::mutate(date = as.Date(paste(epiyear, epiweek, 1, sep="-"), "%Y-%U-%u")) 
# A tibble: 116 x 5
   state          epiyear epiweek icases date      
   <chr>            <dbl>   <dbl>  <dbl> <date>    
 1 Alabama           2020      52  23554 2020-12-28
 2 Alabama           2020      53  26000 NA        
 3 Alaska            2020      52   1791 2020-12-28
 4 Alaska            2020      53   2322 NA        
 5 American Samoa    2020      52      0 2020-12-28
 6 American Samoa    2020      53      0 NA        
 7 Arizona           2020      52  44810 2020-12-28
 8 Arizona           2020      53  46109 NA        
 9 Arkansas          2020      52  13855 2020-12-28
10 Arkansas          2020      53  17473 NA        
# … with 106 more rows
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
2: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid
3: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.

Narrowing down a bit more...

> as.Date("2020-52-1", "%Y-%U-%u")
[1] "2020-12-28"

> as.Date("2020-53-1", "%Y-%U-%u")
[1] NA
Warning message:
In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid

I'm a bit surprised you did NOT get this warning and NAs throw in there... as.Date is base!

R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] focustools_0.0.0.9000

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6      rstudioapi_0.11   magrittr_2.0.1    hms_0.5.3         tidyselect_1.1.0  R6_2.4.1          rlang_0.4.10     
 [8] fansi_0.4.1       dplyr_1.0.2       tools_4.0.0       utf8_1.1.4        cli_2.0.2         ellipsis_0.3.0    assertthat_0.2.1 
[15] tibble_3.0.4      lifecycle_0.2.0   crayon_1.3.4      purrr_0.3.4       readr_1.4.0       tidyr_1.1.2       vctrs_0.3.6      
[22] curl_4.3          glue_1.4.0        compiler_4.0.0    pillar_1.4.7      MMWRweek_0.1.3    generics_0.0.2    lubridate_1.7.9.2
[29] pkgconfig_2.0.3  
stephenturner commented 3 years ago

What happens when you run:

as.Date("2020-52-1", "%Y-%U-%u")
as.Date("2020-53-1", "%Y-%U-%u")
stephenturner commented 3 years ago

(FYI, I pushed a commit to the get-data-fixes branch to deal with a note above thrown by recent(?) tidyselect updates)

stephenturner commented 3 years ago

re distributing as rda: shoiuld be easy if we want to. But wouldn't that data need to be updated pretty frequently to be useful?

NYT (https://github.com/nytimes/covid-19-data/blob/master/LICENSE) looks like we're good (this is not a commercial use) so long as we credit NYT in the data documentation or readme or somewhere else.

JHU (https://github.com/CSSEGISandData/COVID-19) is cc-by.

stephenturner commented 3 years ago

See also https://github.com/signaturescience/focustools/issues/16#issuecomment-762353274 make location column to be either "US" for national, states or county

vpnagraj commented 3 years ago

cool some of this stuff (particularly https://github.com/signaturescience/focustools/issues/16#issuecomment-762353274) is related to state level forecast

i merged the get-data-fixes branch with a new branch where we can work on that: https://github.com/signaturescience/focustools/tree/state-level-ts

stephenturner commented 3 years ago

cool, pulled. Hey, ever figure out that issue with the warnings upthread a bit

vpnagraj commented 3 years ago

not yet.

but BTW hold on any pushes to state-level-ts for now ... lots more about to go up there

vpnagraj commented 3 years ago

@stephenturner thanks for adding the dplyr::all_of(ind:ncol(dat)) to suppress the tidyselect warning

for the as.Date issue with epiweek 53 ... it is weird that i'm not seeing a warning. the call just evaluates to NA for me locally:

as.Date(paste(2020, 52, 1, sep="-"), "%Y-%U-%u")
[1] "2020-12-28"
as.Date(paste(2020, 53, 1, sep="-"), "%Y-%U-%u")
[1] NA

anyways it turns out that we can get around using the dplyr::mutate(date = as.Date(paste(epiyear, epiweek, 1, sep="-"), "%Y-%U-%u")) altogether

pushing these edits up to https://github.com/signaturescience/focustools/tree/state-level-ts

and for the cache feature, yeah, it would require regular updating. and updating via generate_sysdata.R and package rebuild. not the best solution? maybe not worth it? my only thought was that it would ensure that if the JHU and/or NYT folks change the data format (or if they dropped it from GH altogether) at least we'd have some data. but i guess we'd have bigger fish to fry if that happened.

let's close this issue for now. the recent fixes will be operational when we merge the state-level-ts branch, which we can do regardless of whether or not we actually decide to submit state level forecasts. i've checked and the national forecasts are identical, so no harm in merging whenever we're ready.

and if we do move to release the package let's consider a more solid way to handle (at least document) the data provenance / limitations

stephenturner commented 3 years ago

Pushed a few minor edits, readme and pipeline work, keep it closed.

RE making package data, you know, if we wait long enough I'm optimistic that the COVID-19 data in the US won't be relevant to update any more. You know, maybe it'll approach zero one of these days. We could then freeze the data and make a data object at that point? Let's re-open this or create another issue a few months down the road (don't shoot my optimism today please)

stephenturner commented 3 years ago

@vpnagraj what do you think about reopening this and having these functions return a list at different levels of granularity, rather than reading once for US and reading again for state?

vpnagraj commented 3 years ago

i dont think this a bad idea. but some of these issues are getting a little muddled. if we want to address this let's create a new issue specifically for this task so we can better track

also ... we've got a lot of other fish to fry ... if we do create a new issue for this one, it's going to be bottom of the pile