globaldothealth / list

Repository for Global.health: a data science initiative to enable rapid sharing of trusted and open public health data to advance the response to infectious diseases.
MIT License
39 stars 7 forks source link

Declarative ingestion #2554

Open abhidg opened 2 years ago

abhidg commented 2 years ago

Is your feature request related to a problem? Please describe. Much of the ingestion code that we currently have is repetitive, particularly around basic location handling, age parsing and date parsing. The ingestion framework that we have is also closely tied to the format the dataserver expects and there is code repetition across the parsers which parse the data from CSV or JSON into this common format.

In the interest of DRY we should move the common parsing code into its own code. While this is sufficient for DRY, this RFC is about going further and formalizing the operations and fields that are needed to do ingestion and specify that as configuration.

Describe the solution you'd like Other than some parser-specific location code, most of the parser code can be reformulated as configuration. This has the usual advantages and disadvantages of configuration files:

Describe alternatives you've considered Use a DSL in Python, similar to Spack syntax, which gives some of the benefits of the above solution, but with the disadvantage that tools other than Python would find it non-trivial to parse.

Additional context Example of how the US parser would look as a configuration file in TOML with most of the logic abstracted to the parsing library:

name = "USA"
# country: used as a hint for location, if no other
# granular location is specified, geocode to country level
# by using included country code based geocoding map
country = "US"
url = "http://foo.bar"
date_format = "%Y/%m/%d"
missing = ["Missing", "Unknown"]  # NA, None, '' are defined in library as default

[fields]
age = "age_group"
gender = "sex"
ethnicity = "race_ethnicity_combined"
# There's a space in the name of this column in the upstream data.
# This is defined as the earliest non-empty value of either the CDC report date, or
# specimen collection. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data-with-Ge/n8mc-b4w4
# date_*: fields starting with date_ are assumed to be dates
# according to date_format
date_confirmed = "cdc_case_earliest_dt "
date_symptoms = "onset_dt"
comorbidities = "medcond_yn"
status = "current_status"
hospitalized = "hosp_yn"
icu = "icu_yn"
death = "death_yn"

[include]  # inclusion criteria for case in parser
status = "Laboratory-confirmed case"

[gender]
male = "Male"
female = "Female"
other = "Other"

[outcome]  # specify multiple outcomes using array [[outcomes]]
value = "Death"
when = { death = "Yes" }

[icuAdmission]
when = { icu = "Yes" }

[hospitalAdmission]
when = { hospitalized = "Yes" }

[preexistingConditions]
when = { comorbidities = "Yes" }
iamleeg commented 2 years ago

looks good to me. Would also be useful to be able to include common stanzas from other files, for example a lot of the fields are named the same when parsers are reading data in the same source language.