Open trvrb opened 4 months ago
Voting for strategy 2. Reasoning:
In theory, allowing arbitrary build names under keys such as traits
/frequencies
/etc. means certain build names are "reserved" and not allowed. In strategy 2, this means a single name default
is not allowed. In strategy 3, this means many names min_date
/max_date
/etc. are not allowed.
Practically speaking, there may be little chance that a user would want to name a build using one of the reserved names. But it'd be easier to keep the exception list small by using default
as a self-documenting namespace for these settings.
I would also vote to keep strategy 2. I think having an explicit default
key is a nice indicator that the config param can be specified by per build.
Using the frequencies
configs as an example because it's a mix of per workflow and per build params.
Without the default
key, it's not clear which params can be specified per build (as highlighted in strategy 1):
frequencies:
min_date: 2020-01-01
max_date: 2020-05-10
pivot_interval: 4
pivot_interval: 1
pivot_interval_units: "weeks"
narrow_bandwidth: 0.041
proportion_wide: 0.0
With the default
key, I think it's easier to explain that the min_date
and max_date
are the only params that can be specified per build.
frequencies:
default:
min_date: 2020-01-01
max_date: 2020-05-10
pivot_interval: 4
pivot_interval: 1
pivot_interval_units: "weeks"
narrow_bandwidth: 0.041
proportion_wide: 0.0
Using the
frequencies
configs as an example because it's a mix of per workflow and per build params.
Good example! I would go one step further and make sure that all parameters can be made build-specific:
frequencies:
default:
min_date: 2020-01-01
max_date: 2020-05-10
pivot_interval: 4
pivot_interval: 1
pivot_interval_units: "weeks"
narrow_bandwidth: 0.041
proportion_wide: 0.0
# every other entry is a build
Or if there is good reason to prevent them from being set on the build level, clearly mark them as applying to all builds via a separate namespace:
frequencies:
default:
min_date: 2020-01-01
max_date: 2020-05-10
all-builds:
pivot_interval: 4
pivot_interval: 1
pivot_interval_units: "weeks"
narrow_bandwidth: 0.041
proportion_wide: 0.0
# every other entry is a build
2 – builds.yaml override with necessity of
default
This is essentially the solution I came to for avian-flu
. It felt a little cumbersome sometimes, but overall I think works well. I used the FALLBACK
key rather than default
, but the idea is the same, e.g.:
clock_rates:
FALLBACK: # anything not specified by a subtype/time combination
pb2: '' # falls back to FALLBACK, and the empty string means no
pb1: '' # supplied clock rate, i.e. infer the clock
...
h5nx:
2y:
pb2: [0.00287, *clock_std_dev]
pb1: [0.00267, *clock_std_dev]
...
Solution 3 is the same, it just defines default values and build names in the same namespace/dictionary. I'd be happy with consistent use of either (2) or (3).
The workflow will secretly just keep using the
pivot_interval
specified in paramaters.yaml.
We should really prevent this from happening. Schemas are a (the?) way to identify this as an invalid config file and error out early. I don't think we've written any schema files for Snakemake, but we make extensive use of them for augur/auspice JSONs.
Thanks for the feedback here @joverlee521, @victorlin and @jameshadfield. Based on your preferences I updated clade_recency
in https://github.com/nextstrain/ncov/pull/1132 to follow strategy 2 and use config["colors"]["default"]["clade_recency"]
. We should plan to update existing uses of strategy 3 to follow strategy 2 instead.
Schemas are a (the?) way to identify this as an invalid config file and error out early. I don't think we've written any schema files for Snakemake, but we make extensive use of them for augur/auspice JSONs.
We should absolutely be making greater use of schema-based validation for our workflow configs. The only example of us doing so that I'm aware of is ncov, which provides a very basic, non-comprehensive schema (better than nothing, but not great).
Context
Currently, the workflow is a bit of mess when it comes to how one can specify parameters. There seems to be 3 different approaches:
1 – parameters.yaml only
For the
frequencies
rule, the parameterpivot_interval
is retrieved viaconfig["frequencies"]["pivot_interval"]
. This makes it grab exactly what's inparameters.yaml
underfrequencies:pivot_interval
. If someone tries to specify a 4-week build-specificpivot_interval
viaThe workflow will secretly just keep using the
pivot_interval
specified inparamaters.yaml
.Most parameters in the workflow work like this.
2 – builds.yaml override with necessity of
default
For the
traits
rule, the parametersampling_bias_correction
is retrieved via_get_sampling_bias_correction_for_wildcards
which is defined asIe it first looks for a build-specific list of
traits:{build_name}:sampling_bias_correction
and if doesn't find it, it expectstraits:default:sampling_bias_correction
inbuilds.yaml
orparameters.yaml
. This is described in the docs asThis works, but requires
parameters.yaml
to look like:with an extra
default
key compared to other entries inparameters.yaml
.This strategy is only used for the
traits
rule.3 – builds.yaml override without
default
For the
frequencies
rule, the parametermin_date
is retrieved via_get_min_date_for_frequencies
which is defined asIe it starts with trying to grab build-specific
frequencies:{build_name}:min_date
frombuilds.yaml
. If this doesn't exist, it looks forfrequencies:min_date
inbuilds.yaml
orparameters.yaml
and if this doesn't exist it directly returns a sensible default.This strategy is only used for the
frequencies
rule.Description
I believe that we should replace the strategy 2 above used only in
traits
rule with a setup like 3 above. This is what I did when I realized we weren't collectingnarrow_bandwidth
properly. In PR https://github.com/nextstrain/ncov/pull/1130 I followed strategy 3 to specifynarrow_bandwidth
as:We have the issue that if we swap
_get_sampling_bias_correction_for_wildcards
to use strategy 3, we'll need to updateparameters.yaml
to readThis will break custom profiles that external users are running. We could provide backwards compatibility however by looking first for
traits:sampling_bias_correction
and then fortraits:defaults:sampling_bias_correction
.Does this seem reasonable?