nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 129 forks source link

augur curate I/O: verify NDJSON records have the same fields #1510

Closed joverlee521 closed 3 months ago

joverlee521 commented 3 months ago

Context

Originally discussed in https://github.com/nextstrain/augur/pull/1506#discussion_r1660441721

augur curate records can be output to a metadata TSV file, which uses the first record's fields as output columns.

https://github.com/nextstrain/augur/blob/f6ee377336ec3813468fe641fa7910c14e54ced3/augur/io/metadata.py#L467-L474

With that in mind, the centralized inputs parser should verify that the input records all of the same fields. Then subcommands can operate under the assumption that all records should have the same fields and make changes accordingly.

jameshadfield commented 3 months ago

the centralized inputs parser should verify that the input records all of the same fields

Probably sensible to do this for outputs too, to assert the curate subcommand's run() always returns a consistent set of fields. I haven't checked if csv.DictWriter orders each row based on the information previously passed to writeheader() or if it relies on the insertion order being the same for each row, but if it's the latter then we should assert that as well.

jameshadfield commented 3 months ago

Some more details about TSV writing via our generalised write_records_to_tsv: