Declarative ingestion - Githubissues

Is your feature request related to a problem? Please describe. Much of the ingestion code that we currently have is repetitive, particularly around basic location handling, age parsing and date parsing. The ingestion framework that we have is also closely tied to the format the dataserver expects and there is code repetition across the parsers which parse the data from CSV or JSON into this common format.

In the interest of DRY we should move the common parsing code into its own code. While this is sufficient for DRY, this RFC is about going further and formalizing the operations and fields that are needed to do ingestion and specify that as configuration.

Describe the solution you'd like Other than some parser-specific location code, most of the parser code can be reformulated as configuration. This has the usual advantages and disadvantages of configuration files:

Advantages: The configuration file is constrained, and this gives us an easier way to ensure correctness of parsers, as well as make it easier to specify invalid field inputs that are consistent across parsers, such as age ranges. In addition the use of a configuration file, in contrast to a full-fledged DSL or Python gives other tools an easy way to query what fields are supported at the parsing level (just look at which configurations have an age key to find which parsers, and therefore, sources support that field).

By removing much of the complexity of writing the parser, it is also easier to write new parsers (copy over a config file, change the field names). The dataserver specific bits can be abstracted away, which would make it easier to support both updated schemas on the backend, as well as alternative storage formats (the parsing library could directly write to CSV on S3, or just to a file, to aid local testing). Optionally, we can move some of the database fields used for ADI to the parser configuration itself, as well as using the configuration file declaratively to enable/disable parsers.
Disadvantages: The usual disadvantages of configuration files is that they are not expressive enough. To mitigate this, we can support loading python code from submodules for certain fields, particularly location.

Describe alternatives you've considered Use a DSL in Python, similar to Spack syntax, which gives some of the benefits of the above solution, but with the disadvantage that tools other than Python would find it non-trivial to parse.

Additional context Example of how the US parser would look as a configuration file in TOML with most of the logic abstracted to the parsing library:

name = "USA"
# country: used as a hint for location, if no other
# granular location is specified, geocode to country level
# by using included country code based geocoding map
country = "US"
url = "http://foo.bar"
date_format = "%Y/%m/%d"
missing = ["Missing", "Unknown"]  # NA, None, '' are defined in library as default

[fields]
age = "age_group"
gender = "sex"
ethnicity = "race_ethnicity_combined"
# There's a space in the name of this column in the upstream data.
# This is defined as the earliest non-empty value of either the CDC report date, or
# specimen collection. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data-with-Ge/n8mc-b4w4
# date_*: fields starting with date_ are assumed to be dates
# according to date_format
date_confirmed = "cdc_case_earliest_dt "
date_symptoms = "onset_dt"
comorbidities = "medcond_yn"
status = "current_status"
hospitalized = "hosp_yn"
icu = "icu_yn"
death = "death_yn"

[include]  # inclusion criteria for case in parser
status = "Laboratory-confirmed case"

[gender]
male = "Male"
female = "Female"
other = "Other"

[outcome]  # specify multiple outcomes using array [[outcomes]]
value = "Death"
when = { death = "Yes" }

[icuAdmission]
when = { icu = "Yes" }

[hospitalAdmission]
when = { hospitalized = "Yes" }

[preexistingConditions]
when = { comorbidities = "Yes" }

globaldothealth / list

Declarative ingestion #2554