nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

augur curate rename: Add option to drop fields/columns #1526

Open joverlee521 opened 1 month ago

joverlee521 commented 1 month ago

Context

Originally implemented in augur curate rename but dropped due to discussion on UI in https://github.com/nextstrain/augur/pull/1506#discussion_r1669411291.

There are many cases where metadata includes extra fields/columns that should be dropped. We currently do this with tsv-select in the pathogen-repo-guide but it would be nice to have this built into augur curate rename.

Possible solutions

  1. Use existing --field-map option to define "empty" renames ( e.g. --field-map X= drops field X) originally implemented in https://github.com/nextstrain/augur/commit/650bd56e907e8fd96ad07c1424d97d1152800ead
  2. Add an explicit --drop-fields option
  3. Any fields/columns not provided via --field-map get automatically dropped (this is based on the config file idea proposed in https://github.com/nextstrain/augur/issues/1475 where every column must be accounted for in the command).
jameshadfield commented 1 month ago

I implemented the --field-map X= syntax partly based on how git (used to) work when deleting remote branches, and I personally find it nice and concise. But git introduced --delete <branch> so perhaps they'd vote for --drop-fields; then again they described it as syntactic sugar so perhaps not 🤷

Any fields/columns not provided via --field-map get automatically dropped (this is based on the config file idea ...

At a high level I do want a curate pipeline to be able to detect if new columns are in the input data, but I don't think forcing us to list them all out within curate rename is the right place.

genehack commented 1 month ago

I implemented the --field-map X= syntax partly based on how git (used to) work when deleting remote branches, and I personally find it nice and concise. But git introduced --delete <branch> so perhaps they'd vote for --drop-fields

note: strong opinions, weakly held follow

To me, the difference between --field-map X= and --drop-fields X comes down to least-surprise and clear expression.

--drop-fields X is unambiguous: if somebody writes out that flag, they're expecting field X to go away.

Conversely, --field-map X= could be intentional, or it could be somebody meant to come along later and fill in a Y on the RHS of that = but they forgot (or even, they're generating the field map programmatically, and they've botched something, and they're expecting a field name on the RHS but got the empty string instead). I can imagine that troubleshooting something like that in the middle of a big pipeline is going to be fairly frustrating.

joverlee521 commented 1 month ago

+1 for the more explicit option --drop-fields

(I've always found the Git CLI to be confusing...)