globaldothealth / adtl

Another data transformation language
https://adtl.readthedocs.io
MIT License
1 stars 0 forks source link

Add method for reading in long format data #63

Closed pipliggins closed 1 year ago

pipliggins commented 1 year ago

Italian CORE data is in long, rather than wide, form. We will need a new table kind for this. Data format: ID | Form section | Day number | Field name | Value

abhidg commented 1 year ago

This can be done with the current adtl functionality, here is a minimal working example:

long.csv:

case_id,field,value
1,sex,1
1,age,20
2,sex,2
2,age,25

With the following TOML file long.toml:

[adtl]
  name = "long"
  description = "Convert a long table to wide"

  [adtl.tables]
    cases = { kind = "groupBy", groupBy = "id", aggregation = "lastNotNull" }

  [adtl.defs."Y/N/NK".values]
    1 = true
    2 = false

[cases]
  pathogen = "COVID-19"

  [cases.id]
    field = "case_id"

  [cases.sex_at_birth]
    field = "value"
    description = "Sex at Birth"
    values = { 1 = "male", 2 = "female", 3 = "non_binary" }
    if = { field = "sex" }

  [cases.age]
    field = "value"
    description = "Age"
    if = { field = "age" }

Running adtl long.toml long.csv should produce long-cases.csv:

age,id,pathogen,sex_at_birth
20,1,COVID-19,male
25,2,COVID-19,female
pipliggins commented 1 year ago

Ah okay, missed this! I'll have a go with the Italian CORE data.

pipliggins commented 1 year ago

There's an issue here with combinedType data being overwritten as rows are iterated through. Haven't quite pinned down what's happening, but e.g. ethnicity is always finally returned as [None], despite data being present, and found initially.

pipliggins commented 1 year ago

Also - we're going to have to revisit linking different rows according to a field. The date for an observation will obviously be on a different row, and therefore can't be found on a single pass - they're linked by the 'PROGRESSIVE_DAILY' column.

pipliggins commented 1 year ago

As long data only happens once, we'll transform the data rather than adding a new table type.