mitre / data-owner-tools

Tools for the Childhood Obesity Data Initiative (CODI) data owners and partners to use in record linkage
Apache License 2.0
5 stars 8 forks source link

Csv Translation #24

Closed jsrockhill closed 2 years ago

jsrockhill commented 2 years ago

extract_csv() in extract.py now triggers when a .csv config .json is specified from the command line with flag "--csv"(example config file can be found in testing-and-tuning/). extract_csv() works using a modified version of case_insensitive_lookup() from utils/data_reader.py which allows for custom csv mappings to be passed as the version parameter. A validation function is also included in utils/validate.py

radamson commented 2 years ago

@jsrockhill CI is failing style checks. Instructions for checking and auto-correcting style issues are here: https://github.com/mitre/data-owner-tools#formatting-and-linting

radamson commented 2 years ago

@jsrockhill https://github.com/mitre/data-owner-tools/pull/24/commits/5a17af3da4e6583dd2ffffed146fb3607e98f6f9 doesn't actually fix the default values issue. I reverted that change and replaced it with https://github.com/mitre/data-owner-tools/pull/24/commits/94c637f3bf050503f2635e3f18e40c7986efeacf

The issue is present when the column exists, but certain rows are missing a value for it. Here is the relevant code before changes (this also affected the initial incarnation of translation_lookup too, but the code looked a little different).

https://github.com/mitre/data-owner-tools/blob/5a17af3da4e6583dd2ffffed146fb3607e98f6f9/utils/data_reader.py#L113-L114

You can exercise this error by removing a "First Name" from one of the rows in the same csv, and then adding a first name default to the configuration.

The issue with this code is that the column exists and so the value retrieved will be an empty string, not None so the default value conditional is never reached. Just checking for None and a default value won't work as a result. Instead of having individual checks in extract.py for each field, we can check for defaults as part of the lookup function.

Can you look at https://github.com/mitre/data-owner-tools/pull/24/commits/94c637f3bf050503f2635e3f18e40c7986efeacf and see if it looks ok or if I missed anything?

radamson commented 2 years ago

So I reverted the headers in https://github.com/mitre/data-owner-tools/pull/24/commits/5b8cabdfbd5ce5f00522b4f706b87612b45dc08c.

The issue is that the schemas in example-schema is designed around the existing headers. I don't particularly like the existing headers since the csv table header will have household_ prefixed fields even when extracting data for individuals.

I suppose I could still change the validate and report headers anyways, but having those out of sync with what's in the csv also seems bad.

We could either: