nextstrain / ncov

Nextstrain build for novel coronavirus SARS-CoV-2
https://nextstrain.org/ncov
MIT License
1.35k stars 403 forks source link

Improve / standardize the way that we pass through default parameters and build-specific parameters #1131

Open trvrb opened 4 months ago

trvrb commented 4 months ago

Context

Currently, the workflow is a bit of mess when it comes to how one can specify parameters. There seems to be 3 different approaches:

1 – parameters.yaml only

For the frequencies rule, the parameter pivot_interval is retrieved via config["frequencies"]["pivot_interval"]. This makes it grab exactly what's in parameters.yaml under frequencies:pivot_interval. If someone tries to specify a 4-week build-specific pivot_interval via

frequencies:
  global_all-time:
    pivot_interval: 4

The workflow will secretly just keep using the pivot_interval specified in paramaters.yaml.

Most parameters in the workflow work like this.

2 – builds.yaml override with necessity of default

For the traits rule, the parameter sampling_bias_correction is retrieved via _get_sampling_bias_correction_for_wildcards which is defined as

def _get_sampling_bias_correction_for_wildcards(wildcards):
    if wildcards.build_name in config["traits"] and 'sampling_bias_correction' in config["traits"][wildcards.build_name]:
        return config["traits"][wildcards.build_name]["sampling_bias_correction"]
    else:
        return config["traits"]["default"]["sampling_bias_correction"]

Ie it first looks for a build-specific list of traits:{build_name}:sampling_bias_correction and if doesn't find it, it expects traits:default:sampling_bias_correction in builds.yaml or parameters.yaml. This is described in the docs as

   traits:
     default:
       sampling_bias_correction: 2.5
       columns: ["country"]
     washington:
       # Override default sampling bias correction for
       # "washington" build and continue to use default
       # trait columns.
       sampling_bias_correction: 5.0

This works, but requires parameters.yaml to look like:

traits:
  default:
    sampling_bias_correction: 2.5
    columns: ["country"]

with an extra default key compared to other entries in parameters.yaml.

This strategy is only used for the traits rule.

3 – builds.yaml override without default

For the frequencies rule, the parameter min_date is retrieved via _get_min_date_for_frequencies which is defined as

def _get_min_date_for_frequencies(wildcards):
    if wildcards.build_name in config["frequencies"] and "min_date" in config["frequencies"][wildcards.build_name]:
        return config["frequencies"][wildcards.build_name]["min_date"]
    elif "frequencies" in config and "min_date" in config["frequencies"]:
        return config["frequencies"]["min_date"]
    else:
        # If not explicitly specified, default to 1 year back from the present
        min_date_cutoff = datetime.date.today() - datetime.timedelta(weeks=52)
        return numeric_date(
            min_date_cutoff
        )

Ie it starts with trying to grab build-specific frequencies:{build_name}:min_date from builds.yaml. If this doesn't exist, it looks for frequencies:min_date in builds.yaml or parameters.yaml and if this doesn't exist it directly returns a sensible default.

This strategy is only used for the frequencies rule.

Description

I believe that we should replace the strategy 2 above used only in traits rule with a setup like 3 above. This is what I did when I realized we weren't collecting narrow_bandwidth properly. In PR https://github.com/nextstrain/ncov/pull/1130 I followed strategy 3 to specify narrow_bandwidth as:

def _get_narrow_bandwidth_for_wildcards(wildcards):
    # check if builds.yaml contains frequencies:{build_name}:narrow_bandwidth
    if wildcards.build_name in config["frequencies"] and 'narrow_bandwidth' in config["frequencies"][wildcards.build_name]:
        return config["frequencies"][wildcards.build_name]["narrow_bandwidth"]
    # check if parameters.yaml contains frequencies:narrow_bandwidth
    elif "frequencies" in config and "narrow_bandwidth" in config["frequencies"]:
        return config["frequencies"]["narrow_bandwidth"]
    # else return augur frequencies default value
    else:
        return 0.0833

We have the issue that if we swap _get_sampling_bias_correction_for_wildcards to use strategy 3, we'll need to update parameters.yaml to read

traits:
  sampling_bias_correction: 2.5
  columns: ["country"]

This will break custom profiles that external users are running. We could provide backwards compatibility however by looking first for traits:sampling_bias_correction and then for traits:defaults:sampling_bias_correction.

Does this seem reasonable?

victorlin commented 4 months ago

Voting for strategy 2. Reasoning:

In theory, allowing arbitrary build names under keys such as traits/frequencies/etc. means certain build names are "reserved" and not allowed. In strategy 2, this means a single name default is not allowed. In strategy 3, this means many names min_date/max_date/etc. are not allowed.

Practically speaking, there may be little chance that a user would want to name a build using one of the reserved names. But it'd be easier to keep the exception list small by using default as a self-documenting namespace for these settings.

joverlee521 commented 3 months ago

I would also vote to keep strategy 2. I think having an explicit default key is a nice indicator that the config param can be specified by per build.

Using the frequencies configs as an example because it's a mix of per workflow and per build params.

Without the default key, it's not clear which params can be specified per build (as highlighted in strategy 1):

frequencies:
    min_date: 2020-01-01
    max_date: 2020-05-10
    pivot_interval: 4
    pivot_interval: 1
    pivot_interval_units: "weeks"
    narrow_bandwidth: 0.041
    proportion_wide: 0.0

With the default key, I think it's easier to explain that the min_date and max_date are the only params that can be specified per build.

frequencies:
    default:
        min_date: 2020-01-01
        max_date: 2020-05-10
    pivot_interval: 4
    pivot_interval: 1
    pivot_interval_units: "weeks"
    narrow_bandwidth: 0.041
    proportion_wide: 0.0
victorlin commented 3 months ago

Using the frequencies configs as an example because it's a mix of per workflow and per build params.

Good example! I would go one step further and make sure that all parameters can be made build-specific:

frequencies:
    default:
        min_date: 2020-01-01
        max_date: 2020-05-10
        pivot_interval: 4
        pivot_interval: 1
        pivot_interval_units: "weeks"
        narrow_bandwidth: 0.041
        proportion_wide: 0.0
    # every other entry is a build

Or if there is good reason to prevent them from being set on the build level, clearly mark them as applying to all builds via a separate namespace:

frequencies:
    default:
        min_date: 2020-01-01
        max_date: 2020-05-10
    all-builds:
        pivot_interval: 4
        pivot_interval: 1
        pivot_interval_units: "weeks"
        narrow_bandwidth: 0.041
        proportion_wide: 0.0
    # every other entry is a build
jameshadfield commented 3 months ago

2 – builds.yaml override with necessity of default

This is essentially the solution I came to for avian-flu. It felt a little cumbersome sometimes, but overall I think works well. I used the FALLBACK key rather than default, but the idea is the same, e.g.:

  clock_rates:
    FALLBACK:            # anything not specified by a subtype/time combination
        pb2: ''          # falls back to FALLBACK, and the empty string means no
        pb1: ''          # supplied clock rate, i.e. infer the clock
        ...
    h5nx:
      2y:
        pb2: [0.00287, *clock_std_dev]
        pb1: [0.00267, *clock_std_dev]
        ...

Solution 3 is the same, it just defines default values and build names in the same namespace/dictionary. I'd be happy with consistent use of either (2) or (3).

The workflow will secretly just keep using the pivot_interval specified in paramaters.yaml.

We should really prevent this from happening. Schemas are a (the?) way to identify this as an invalid config file and error out early. I don't think we've written any schema files for Snakemake, but we make extensive use of them for augur/auspice JSONs.

trvrb commented 2 months ago

Thanks for the feedback here @joverlee521, @victorlin and @jameshadfield. Based on your preferences I updated clade_recency in https://github.com/nextstrain/ncov/pull/1132 to follow strategy 2 and use config["colors"]["default"]["clade_recency"]. We should plan to update existing uses of strategy 3 to follow strategy 2 instead.

tsibley commented 1 month ago

Schemas are a (the?) way to identify this as an invalid config file and error out early. I don't think we've written any schema files for Snakemake, but we make extensive use of them for augur/auspice JSONs.

We should absolutely be making greater use of schema-based validation for our workflow configs. The only example of us doing so that I'm aware of is ncov, which provides a very basic, non-comprehensive schema (better than nothing, but not great).