nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
267 stars 128 forks source link

transform-strain-name: build strain name by concatenating fields #1515

Open joverlee521 opened 9 months ago

joverlee521 commented 9 months ago

Context

Following the naming pattern set in SARS-CoV-2 sequences, strain names are usually <country>/<sample_id>/<year>. All three fields are typically available in the metadata so we can concatenate them to "build" a reasonable strain name.

Description

We could extend the existing augur curate transform-strain-name to accept input columns that are concatenated with a provided separator.

Examples

joverlee521 commented 9 months ago

I briefly explored if I could recreate Cornelius' script with csvtk mutate2, but ran into an error:

$ csvtk -t mutate2 -e ' $country + "/" + $accession + "/" + $date ' -n strain_display -s monkeypox-metadata.tsv 
[ERRO] Cannot transition token types from MODIFIER [+] to TIME [2007-10-30 00:00:00 -0700 PDT]

Edit: csvtk also converts dates to floats. This behavior will not change until the underlying evaluation package is updated.