socrata / discuss

Discuss all the things!
https://dev.socrata.com
Apache License 2.0
10 stars 3 forks source link

[healthdata.gov/6hii-ae4f] How can I iteratively read Community Profile Report .xslx files for a range of dates into an R dataframe? #146

Open dbcrow opened 2 years ago

dbcrow commented 2 years ago

API Docs: https://dev.socrata.com/foundry/healthdata.gov/6hii-ae4f

Hi-

My goal is to read into R all the daily .xlsx files named "Community_Profile_Report_YYYYMMDD_Public.xlsx" (where YYYYMMDD is the date of the daily file). I'm trying to create R code that iterates through the filenames, reads in each file, selects certain columns from the "County" tab", and then performs some statistical analyses.

All the .xlsx files are listed on this webpage: https://healthdata.gov/Health/COVID-19-Community-Profile-Report/gqxm-d9w9.

Here's the R code I have for reading in the .json file with the metadata file updates:

df <- read.socrata(
  "https://healthdata.gov/resource/6hii-ae4f.json",
  app_token = "myToken",
  email     = "myEmail",
  password  = "myPassword"
  )

df_raw <- RJSONIO::fromJSON("https://healthdata.gov/resource/6hii-ae4f.json", flatten=TRUE)
df <- df_raw[[1]][["metadata_published"]]`

The dataframe "df" contains a long jumble of fields including "assetIDs" like this

\"assetId\": \"4a1aed7c-b983-4a34-9b0d-431b57ad6c55\"

and "filenames" like this:

"filename\": \"Community_Profile_Report 20201219_Public.xlsx\

Here's what a few lines from the dataframe look like, an essentially unintelligible jumble to me (but not to you!):

[{\"filename\": \"Community_Profile_Report 20201217_Public.xlsx\", \"assetId\": \"4de459d6-23d0-4691-8dd8-f2132ecebf47\", \"name\": \"Community_Profile_Report_20201217_Public.xlsx\"}, {\"filename\": \"Community_Profile_Report 20201217_Public.pdf\", \"assetId\": \"58dc781b-d147-4d5a-8dde-d1feb4fa59da\", \"name\": \"Community_Profile_Report_20201217_Public.pdf\"}, {\"filename\": \"Community_Profile_Report 20201218_Public.xlsx\", \"assetId\": \"a7d95595-8db6-4360-b87a-e8879f4c112e\", \"name\": \"Community_Profile_Report_20201218_Public.xlsx\"}, {\"filename\": \"Community_Profile_Report 20201218_Public.pdf\", \"assetId\": \"4a1aed7c-b983-4a34-9b0d-431b57ad6c55\", \"name\": \"Community_Profile_Report_20201218_Public.pdf\"}, {\"filename\": \"Community_Profile_Report 20201219_Public.xlsx\", \"assetId\": \"df6ad32a-2fe8-4878-ae3d-a9d972c85ea8\", \"name\": \"Community_Profile_Report_20201219_Public.xlsx\"}, {\"filename\": \"Community_Profile_Report 20201219_Public.pdf\", \"assetId\": \"554644bc-8791-4085-8996-d432d31e2665\", \"name\": \"Community_Profile_Report_20201219_Public.pdf\"}, {\"filename\": \"Community_Profile_Report 20201220_Public.xlsx\", \"assetId\": \"ca666bde-0e88-4753-a03c-e0f5132777a9\", \"name\": \"Community_Profile_Report_20201220_Public.xlsx\"}, {\"filename\": \"Community_Profile_Report 20201220_Public.pdf\", \"assetId\": \"46070ae8-31b4-4ae6-9a9a-4f4647bf54ce\"

The goal would be to extract all the "assetIDs" and "filenames" and put them into a dataframe so that I could concatenate the assetIDs and filenames to create a list that of URLs that R could cycle through to read in the .xlsx files. I would then manipulate the columns to get a single flat file (in "stacked" or "long" form) with the county in one column, the date (nested in counties) in a second column, and then values for different variables (e.g., COVID-19 cases for a given county-date) in the reamaining columns.

Here's what code would look like to read in an .xlsx file for a single date:

url <- "https://healthdata.gov/api/views/gqxm-d9w9/files/3bf3ff9c-67d3-47af-aad3-799e21de6e3d?download=true&filename=Community%20Profile%20Report%2020210903.xlsx"

df2 <- read.xlsx(url, sheet=6) %>% select(1,2,6,8,16,76,77) %>% .[-1, ]
colnames(df2) <- c("county","FIPS","state","pop","cases_last7","vaccinated","vacc_prop")
cols.name <- c("FIPS","pop","cases_last7","vaccinated","vacc_prop")
df2[cols.name] <- sapply(df2[cols.name], as.numeric)
df2 <- df2 %>% filter(complete.cases(.)) %>% mutate(cases_last7_100k = cases_last7/pop * 100000) %>%   
        mutate(vacc_prop = vacc_prop*100) %>% filter(state != "PR" & vacc_prop < 100)

I'd greatly appreciate any suggestions.

Thanks, David