Closed vpnagraj closed 3 years ago
the get_cases()
and get_deaths()
functions are coming together ...
idea is that these will provide a standard interface for retrieving count data
as of https://github.com/signaturescience/focustools/commit/265de9938f454090c667d0b25adf672b3d831263 each function now accepts an argument for "source" and "granularity". currently the source must be either "nyt" or "jhu" and the granularity must be "national", "state", or "county" (although only "national" will work for source = "nyt" right now)
more documentation coming soon
also still need to verify the results (both incident and cumulative counts) coming out of each of these functions.
Slick. I'll pull and test when I get a chance.
Something's not exactly lining up at least with the nyt data. The util script I wrote doesn't give you functions, it just gives you data. Let's move this into scratch or get rid of it once we fix this because I prefer calling a function rather than sourcing a script to get data. But either way, numbers aren't matching up.
suppressPackageStartupMessages(library(tidyverse))
source("utils/get-us-data.R")
source("utils/get_data.R")
# from my script
usa
usa2 <- get_cases(source = "nyt", granularity = "national")
ujoin <- inner_join(usa %>% select(epiweek, icases),
usa2 %>% select(epiweek, icases),
by="epiweek", suffix=c(".sdt", ".vpn"))
ujoin %>% ggplot(aes(icases.sdt, icases.vpn)) + geom_point()
Lots of messages coming in trying to read jhu data and sum to national. The messages could be suppressed by being explicit about column types, but the last bit was new to me. Looks like this is described at https://tidyselect.r-lib.org/reference/faq-external-vector.html, with documentation on all_of
at https://tidyselect.r-lib.org/reference/all_of.html.
> usa3 <- get_cases(source="jhu", granularity = "national")
Parsed with column specification:
cols(
.default = col_double(),
iso2 = col_character(),
iso3 = col_character(),
Admin2 = col_character(),
Province_State = col_character(),
Country_Region = col_character(),
Combined_Key = col_character()
)
See spec(...) for full column specifications.
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(ind)` instead of `ind` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
Digging into the comment two up, comparing JHU data to NYT data using your get data script lines up much more nicely than the NYT data coming from my script.
usa3 <- get_cases(source="jhu", granularity = "national")
# jhu vs nyt using get_data functions
inner_join(usa3 %>% select(epiweek, icases.jhu=icases),
usa2 %>% select(epiweek, icases.nyt=icases),
by="epiweek") %>%
ggplot(aes(icases.nyt, icases.jhu)) + geom_point()
# jhu vs nyt using get-us-data.R script
inner_join(usa3 %>% select(epiweek, icases.jhu=icases),
usa %>% select(epiweek, icases.nyt=icases),
by="epiweek") %>%
ggplot(aes(icases.nyt, icases.jhu)) + geom_point()
From your functions 😄
From my script 😦
So my guess, either I'm doing something wrong with processing the NYT data, or you're doing something wrong on both. I trust my own quick hack less.
hmmm.
re: the messages yeah we can/should clean that up by passing column names in read_
function ... i'll also take a closer look at that all_of(ind)
message (i think that is stemming from some kind of nasty code i had to use to reshape the tibble https://github.com/signaturescience/focustools/blob/master/utils/get_data.R#L11-L15)
as for validating the counts ...
i was hoping we could find another source that reported the nyt or jhu data by week (and ideally by the same granularity we use in get_cases()
and get_deaths()
)
one option is the C19FH weekly report for deaths:
https://covid19forecasthub.org/reports/2020-12-15-weekly-report.html
i need to look at this a lot more closely but i don't think we're far off? at least with get_deaths()
?
jhu_deaths_national <- get_deaths(source = "jhu", granularity = "national")
jhu_deaths_national %>%
tail()
epiyear | epiweek | ideaths | cdeaths |
---|---|---|---|
2020 | 45 | 6971 | 238193 |
2020 | 46 | 7609 | 245802 |
2020 | 47 | 10262 | 256064 |
2020 | 48 | 10095 | 266159 |
2020 | 49 | 15121 | 281280 |
2020 | 50 | 16562 | 297842 |
i think get_cases()
and get_deaths()
are working.
one outstanding feature is the "cache" option, which would allow us to precompute case/death data and conditionally load that instead of going out to the API.
i'm going to leave this open but mark the corresponding "Initial Method and Dev" task as done ... i think i can add this to another (like "Pipeline Automation") later
Sounds good. I didn't post this but the other day I took a look at i/c cases/deaths scatterplots from nyt vs jhu. they're very close. For now I've been using NYT just because it's much faster to pull down, being already summarized, but the cache option might help when drilling down further later on.
When looking into #26 lots of warnings are thrown when trying to get state-level data (deaths and cases)
> usad <- get_deaths(source="jhu", granularity = "state")
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
.default = col_double(),
iso2 = col_character(),
iso3 = col_character(),
Admin2 = col_character(),
Province_State = col_character(),
Country_Region = col_character(),
Combined_Key = col_character()
)
ℹ Use `spec()` for the full column specifications.
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
2: In strptime(x, format, tz = "GMT") :
(0-based) yday 369 in year 2020 is invalid
3: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
4: In strptime(x, format, tz = "GMT") :
(0-based) yday 369 in year 2020 is invalid
5: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
6: In strptime(x, format, tz = "GMT") :
(0-based) yday 369 in year 2020 is invalid
@stephenturner weird. i just ran the same and could not reproduce the warnings ...
i guess this is part of the issue with pulling data that is being updated directly from GH. and maybe even more motivation for the cache
argument.
speaking of, before we add the cache feature (by pre-pulling data in generate_sysdata.R and saving as internal obj) we need to confirm the license for redistributing the JHU / NYT data sets
Weird... Here's the relevant bit of code that's giving me the issue. The last bit subsets to epiweek 53, showing the problem.
dat <- readr::read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
ind <- which(names(dat) == "1/22/20")
dat <-
dat %>%
tidyr::gather(date, count, all_of(ind:ncol(dat))) %>%
## drop unnecessary columns
dplyr::select(-iso2,-code3,-Country_Region) %>%
dplyr::mutate(date = as.Date(date, format = "%m/%d/%y")) %>%
dplyr::mutate(epiyear=MMWRweek::MMWRweek(date)$MMWRyear, .after=date) %>%
dplyr::mutate(epiweek=MMWRweek::MMWRweek(date)$MMWRweek, .after=epiyear) %>%
dplyr::rename(county = Admin2, fips = FIPS, state = Province_State) %>%
dplyr::group_by(county, fips, state) %>%
dplyr::arrange(date) %>%
## coerce from cumulative to incident cases
## hold onto count as "ccases" for cumulative cases
dplyr::mutate(icases = count - dplyr::lag(count, default = 0L),
ccases = count)
dat %>%
## within each state, year, week grouping
dplyr::group_by(state,epiyear,epiweek) %>%
## sum up the number of deaths at that week
dplyr::summarise(icases = sum(icases, na.rm = TRUE), .groups = "drop") %>%
## ignore the current week because it will likely be incomplete ...
dplyr::filter(epiweek != lubridate::week(Sys.Date())) %>%
## create a variable for date using epiyear and week
dplyr::filter(epiweek==53 | epiweek==52) %>%
dplyr::mutate(date = as.Date(paste(epiyear, epiweek, 1, sep="-"), "%Y-%U-%u"))
# A tibble: 116 x 5
state epiyear epiweek icases date
<chr> <dbl> <dbl> <dbl> <date>
1 Alabama 2020 52 23554 2020-12-28
2 Alabama 2020 53 26000 NA
3 Alaska 2020 52 1791 2020-12-28
4 Alaska 2020 53 2322 NA
5 American Samoa 2020 52 0 2020-12-28
6 American Samoa 2020 53 0 NA
7 Arizona 2020 52 44810 2020-12-28
8 Arizona 2020 53 46109 NA
9 Arkansas 2020 52 13855 2020-12-28
10 Arkansas 2020 53 17473 NA
# … with 106 more rows
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
2: In strptime(x, format, tz = "GMT") :
(0-based) yday 369 in year 2020 is invalid
3: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
Narrowing down a bit more...
> as.Date("2020-52-1", "%Y-%U-%u")
[1] "2020-12-28"
> as.Date("2020-53-1", "%Y-%U-%u")
[1] NA
Warning message:
In strptime(x, format, tz = "GMT") :
(0-based) yday 369 in year 2020 is invalid
I'm a bit surprised you did NOT get this warning and NAs throw in there... as.Date is base!
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] focustools_0.0.0.9000
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 rstudioapi_0.11 magrittr_2.0.1 hms_0.5.3 tidyselect_1.1.0 R6_2.4.1 rlang_0.4.10
[8] fansi_0.4.1 dplyr_1.0.2 tools_4.0.0 utf8_1.1.4 cli_2.0.2 ellipsis_0.3.0 assertthat_0.2.1
[15] tibble_3.0.4 lifecycle_0.2.0 crayon_1.3.4 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 vctrs_0.3.6
[22] curl_4.3 glue_1.4.0 compiler_4.0.0 pillar_1.4.7 MMWRweek_0.1.3 generics_0.0.2 lubridate_1.7.9.2
[29] pkgconfig_2.0.3
What happens when you run:
as.Date("2020-52-1", "%Y-%U-%u")
as.Date("2020-53-1", "%Y-%U-%u")
(FYI, I pushed a commit to the get-data-fixes branch to deal with a note above thrown by recent(?) tidyselect updates)
re distributing as rda: shoiuld be easy if we want to. But wouldn't that data need to be updated pretty frequently to be useful?
NYT (https://github.com/nytimes/covid-19-data/blob/master/LICENSE) looks like we're good (this is not a commercial use) so long as we credit NYT in the data documentation or readme or somewhere else.
JHU (https://github.com/CSSEGISandData/COVID-19) is cc-by.
See also https://github.com/signaturescience/focustools/issues/16#issuecomment-762353274 make location column to be either "US" for national, states or county
cool some of this stuff (particularly https://github.com/signaturescience/focustools/issues/16#issuecomment-762353274) is related to state level forecast
i merged the get-data-fixes branch with a new branch where we can work on that: https://github.com/signaturescience/focustools/tree/state-level-ts
cool, pulled. Hey, ever figure out that issue with the warnings upthread a bit
not yet.
but BTW hold on any pushes to state-level-ts for now ... lots more about to go up there
@stephenturner thanks for adding the dplyr::all_of(ind:ncol(dat))
to suppress the tidyselect warning
for the as.Date
issue with epiweek 53 ... it is weird that i'm not seeing a warning. the call just evaluates to NA
for me locally:
as.Date(paste(2020, 52, 1, sep="-"), "%Y-%U-%u")
[1] "2020-12-28"
as.Date(paste(2020, 53, 1, sep="-"), "%Y-%U-%u")
[1] NA
anyways it turns out that we can get around using the dplyr::mutate(date = as.Date(paste(epiyear, epiweek, 1, sep="-"), "%Y-%U-%u"))
altogether
pushing these edits up to https://github.com/signaturescience/focustools/tree/state-level-ts
and for the cache feature, yeah, it would require regular updating. and updating via generate_sysdata.R and package rebuild. not the best solution? maybe not worth it? my only thought was that it would ensure that if the JHU and/or NYT folks change the data format (or if they dropped it from GH altogether) at least we'd have some data. but i guess we'd have bigger fish to fry if that happened.
let's close this issue for now. the recent fixes will be operational when we merge the state-level-ts branch, which we can do regardless of whether or not we actually decide to submit state level forecasts. i've checked and the national forecasts are identical, so no harm in merging whenever we're ready.
and if we do move to release the package let's consider a more solid way to handle (at least document) the data provenance / limitations
Pushed a few minor edits, readme and pipeline work, keep it closed.
RE making package data, you know, if we wait long enough I'm optimistic that the COVID-19 data in the US won't be relevant to update any more. You know, maybe it'll approach zero one of these days. We could then freeze the data and make a data object at that point? Let's re-open this or create another issue a few months down the road (don't shoot my optimism today please)
@vpnagraj what do you think about reopening this and having these functions return a list at different levels of granularity, rather than reading once for US and reading again for state?
i dont think this a bad idea. but some of these issues are getting a little muddled. if we want to address this let's create a new issue specifically for this task so we can better track
also ... we've got a lot of other fish to fry ... if we do create a new issue for this one, it's going to be bottom of the pile
moving forward we should abstract our code for retrieving case, death, hospitalization (and other data types) to a series of utility functions
to start let's focus on
get_cases()
andget_deaths()
each function could have an arg for "source" (NYT, JHU, etc) and potentially also an argument for granularity
these
get_*
data functions would wrap other helpers (like epiweek translation) and return a prepped tibble (or list of tibbles if we wanted one for incident and cumulative cases/deaths) that we could then put into our modeling framework(s)Also, from #1: