signaturescience / fiphde

Forecasting Influenza in Support of Public Health Decision Making
https://signaturescience.github.io/fiphde/
GNU General Public License v3.0
3 stars 1 forks source link

validate that get_hosp is working with changes to hdgov data source #186

Closed vpnagraj closed 11 months ago

vpnagraj commented 1 year ago

our data retrieval for flu hospitalization data was originally written to retrieve daily counts from the HHS Protect data reported via healthdata.gov API and convert to weekly incidence:

https://signaturescience.github.io/fiphde/reference/get_hdgov_hosp.html

https://signaturescience.github.io/fiphde/reference/prep_hdgov_hosp.html

the reporting requirements and cadence has changed.

we need to validate that our data retrieval is working as expected.

questions to answer:

@dwill023 i am assigning you to take a look at this. use the thread in this issue to communicate what you find / address any other questions that you have along the way.

dwill023 commented 1 year ago

It doesn't look like we have to update the get_hdgov_hosp function. The api has previous_dat_admission_influenza_confirmed data ending on 07/15/2023 and we are grabbing that as our flu.admits column. We are getting the same number of rows as the API. The prep_hdgov_hosp also looks to be performing as expected. It removed the incomplete week of 07/14/2023 as we have programed it to start weekly aggregations on a sun and end on a sat. Since the API doesn't have data published for that Saturday the function took off that week.

So we are getting the latest data from https://healthdata.gov/api/views/g62h-syeh/rows.csv

dwill023 commented 1 year ago

The site also mentioned that after Monday June 12th, 2023, the dataset will only be updated once a week on Fridays.

vpnagraj commented 1 year ago

@dwill023 thanks for digging into this!

im also seeing that our get_hdgov_hosp() %>% prep_hdgov_hosp() pipeline still returns data. however, like you said ... that data appears to now be lagged by a week.

take a look at the reprex below. in that case, we previously would have expected to have prepared data all the way through the most recent week (i.e., the week ending 2023-07-29 (saturday) if today is a monday with week ending 2023-08-05). we're seeing 1 week gap, which lines up with the messaging from CDC regarding overall changes to HHS reporting.

im not sure the best way to fix this at the moment. we will likely need to either 1) nowcast for the most recent week => train models with nowcasted data or 2) shift modeling back 1 week and forecast 5 weeks ahead to get to the 4 week (from the forecast date) horizon.

leaving this issue open to help us prepare for the 2023-24 season.

library(fiphde)

Sys.Date()
#> [1] "2023-07-31"
h <- get_hdgov_hosp(limitcols = TRUE)
#> 66593 rows retrieved from:
#> https://healthdata.gov/api/views/g62h-syeh/rows.csv
max(h$date)
#> [1] "2023-07-21"
h_weekly <- prep_hdgov_hosp(h, remove_incomplete = FALSE, min_per_week = 0)
#> Summarizing to epiyear/epiweek
#> Trimming to 2020-10-18
#> Filtering to US+DC+States only
#> Removing states with < 0 flu.admits per week on average over the last month
#> Removed 0 states:
max(h_weekly$week_end)
#> [1] "2023-07-22"

Created on 2023-07-31 with reprex v2.0.2

vpnagraj commented 11 months ago

this API (and our data retrieval function) is working as documented, albeit with the delays noted above.

closing this issue