nextstrain / ncov

Nextstrain build for novel coronavirus SARS-CoV-2
https://nextstrain.org/ncov
MIT License
1.35k stars 403 forks source link

Remove redundancy in config file values #1144

Closed victorlin closed 1 week ago

victorlin commented 3 weeks ago

Options such as subsampling.<sample>.min_date must contain the option flag to augur filter, e.g.

min_date: --min-date 2020-01-01

This is redundant and requires knowledge about an extra set of syntax on top of config file syntax. It also causes confusion resulting in PRs such as #886.

victorlin commented 3 weeks ago

Based on https://github.com/nextstrain/ncov/pull/886#pullrequestreview-1165909150 this is a https://github.com/nextstrain/ncov/labels/wontfix:

The min and max date parameters in ncov's filter logic don't behave like you'd expect due to an implementation choice we made at the beginning of the pandemic that we were then unable to change without breaking everyone's workflows.

Couldn't we release another major version of the ncov workflow to make this breaking change? Maybe this has been considered before – of course we could, but doesn't mean we should. But then what is the point of versioning this workflow?

jameshadfield commented 3 weeks ago

There are quite a few of these throughout ncov, e.g.

https://github.com/nextstrain/ncov/blob/9e9f3d4e950ad48f9497183daff8b30286b9655e/workflow/snakemake_rules/main_workflow.smk#L166-L173

and I think we learned from this for other pathogen worflows. There seems to be a few possible choices here:

  1. Move every config value to (e.g.) subsampling.<sample>.min_date: <date> and deal with the fall-out of many peoples' breaking configs
  2. Leave it as it is.
  3. Leave it as it is and add some syntax checking, e.g. if not subsampling.<sample>.min_date.startswith("--min-date"): raise.... This could be done gradually if needed.
  4. Allow both, e.g. if not subsampling.<sample>.min_date.startswith("--min-date"): subsampling.<sample>.min_date=f"--min-date {...}"

I'd go with 3, but 2 also seems fine.

victorlin commented 3 weeks ago

@jameshadfield thanks for listing out all the options. Not sure that (3) is worth it. It seems that most users are able to figure it out, but it's an unnecessary point of confusion when starting out. I would either continue doing nothing (2) or plan a path to deprecation, which would be something like (4) with a deprecation warning then (1) in a future release.

Maybe this hinges on some bigger picture questions: How much should we care about

a. consistency within the workflow? b. aligning config with other workflows? c. backwards compatibility between workflow versions?

victorlin commented 1 week ago

Had a good discussion during dev chat. Consensus: