nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
267 stars 128 forks source link

Metadata wrangling prior to merging with existing data #1475

Open jameshadfield opened 1 month ago

jameshadfield commented 1 month ago

What are we trying to solve?

We have data (metadata + sequences) which we want to run in a build by combining it with other source(s), most commonly Nextstrain's s3 provided inputs. This happens frequently within the group, although it may be encountered more on the science side rather than the production side. It happens routinely for external users. (See also this slack message)

Doesn't the ingest pipeline do this?

Yes, for certain use cases. But I don't think we expect (m)any users to write ingest pipelines, I see them as a framework for how the nextstrain team separates concerns for production builds.

Ok, but you can just write a rule to use an ingest-like transform step to combine the data, right?

Totally. If you're comfortable in this ecosystem. Most users aren't.

How could this work in the simple situation where column names match?

We could supply multiple inputs¹ to augur filter and it would combine the data for us ahead of any filtering/subsampling steps. This step would encode the input source so that subsampling could be conditional on this, as needed.

There's a bunch of prior discussion here, and I think discussion about this particular functionality is best directed to the following conversations:

What about the situation where my metadata TSV has different column names?

I propose a new command. For arguments sake (and to avoid bikeshedding) let's call this augur consume. It would take some kind of configuration file such as:

[rename-columns]
strain = "strain"       # no change to column name
admin2 = "division"     # rename column
patient_name = "DROP"

and in the simplest case you'd invoke it as augur consume --input <tsv> --output <tsv> --config <config>. You could then use this file as an input to augur filter (as above). Ok, nothing groundbreaking yet, but keep reading.

What's TSV? My source-of-truth is Excel. The above augur curate would take excel files as input. This would be the gateway to consume this data in a pipeline such that the source of truth stays as Excel and is read upon each workflow run.

Why stop at Excel? Don't. Add NDJSON, Google Sheets, etc. Our development choices can be driven by PH partnerships.

Column names are easy, what about some other formatting? Essentially every functionality present within augur curate + the vendored ingest scripts could be exposed via the above config file. Geographic fixes, date fixes, renaming etc. Both systematic fixes and one-offs.

How do I write this config file? Seems hard

There would be an interactive mode (Text User Interface mode) which would take your input data as well as reference data (e.g. our provisioned S3 data, augur default data) and guide you through the choices available. To stay with the column renaming example, we can use the reference data to know what columns are missing and therefore may be using a different name in your data; importantly we're here setting up a relationship between the input data and the reference data. This interactivity would read/write state to the config such that a subsequent run of the same input wouldn't need any user-interaction.

How does an interactive mode fit with a repeatable pipeline

As the config could know metadata about the previous inputs it lends itself very well to reproducibility. If the data format is unchanged then things will "just work" each time. If the data format's changed then this step is a great place to stop and let the user know; a quick detour at this stage through the interactive mode can update the config to represent the new format.

Would this help PH users?

I believe so. But we should design it via a collaborative approach to make sure.


¹ Obviously lots of details about the specifics and the exact merging logic, but I imagine it to be similar to the approach I implemented for ncov ~3 years ago which was "was written in preparation for a future augur where commands may take multiple metadata files, thus making this script unnecessary"

joverlee521 commented 1 month ago

+1 for this epic user story! The proposed plan of helping users who are less familiar with command line + TSV wrangling is very ambitious and a worthwhile pursuit with PH users in mind 🙌

What about the situation where my metadata TSV has different column names? I propose a new command.

I had initially thought of supporting this through augur curate (transform-field-names is the prototype), but then I learned about csvtk rename / rename2 exist. rename2 is able to take a tab-delimited key-value file for replacing key with value. Would we want to just wrap csvtk here for a friendlier UI?

What's TSV? My source-of-truth is Excel.

Totally agree to add support for Excel through augur curate inputs. I briefly thought about it when first implementing the augur curate I/O interface, but clearly never got back to officially proposing it.

Why stop at Excel? Don't. Add NDJSON, Google Sheets, etc. Our development choices can be driven by PH partnerships.

If we push this into augur curate, we already support NDJSON. I think it'd be nice to have augur curate as the central command for supporting all the different types of metadata formats.

Column names are easy, what about some other formatting? Essentially every functionality present within augur curate + the vendored ingest scripts could be exposed via the above config file. Geographic fixes, date fixes, renaming etc. Both systematic fixes and one-offs.

Definitely hear this want/need. I wanted augur curate to be more built out before thinking of the best way to "wrap" subcommands with a high level config, but maybe the time is now.

How do I write this config file? Seems hard There would be an interactive mode (Text User Interface mode) which would take your input data as well as reference data (e.g. our provisioned S3 data, augur default data) and guide you through the choices available.

To me, this is the crux of this issue! Would love to brainstorm on how this interactive mode would work.

tsibley commented 1 month ago

+1 for the write up and general ideas. James and I chatted a little on Tuesday about this topic. I think lowering barriers to adoption is a huge win and efforts to do so are worth undertaking.

Would we want to just wrap csvtk here for a friendlier UI?

Field renames are a straightforward enough operation that I'd think integration with augur curate + config + TUI will be easier if we don't shell out to an external command for it.

Would love to brainstorm on how this interactive mode would work.

In terms of what the UI can be like, the sky is the limit. But I'd say the bulk of the work is gonna be making the UI work really well instead of being clunky.