nextstrain / pathogen-repo-guide

4 stars 1 forks source link

How should we standardize config schema for files? #26

Open joverlee521 opened 9 months ago

joverlee521 commented 9 months ago

Context

This issue was originally brought up in https://github.com/nextstrain/measles/pull/9#pullrequestreview-1819090281 and I wanted to document options and consensus on how we should standardize Nextstrain config schemas for files. This is specifically for discussing config files required for workflows such as reference.gb, exclude.txt, include.txt, etc.

Historically, these files were provided at the top of the Snakefile. The files are grouped together instead of within rule specific params because a single file may be used by multiple rules.

To make the files easily configurable, the ncov workflow moved the files to the config YAML under a top level files key. Other pathogen repos have taken similar paths of providing config files through the config YAML, but have varying schemas. mpox uses top level file name keys (i.e. drops the files key) and it also includes rule specific file name keys:

auspice_config: "config/mpxv/auspice_config.json"
include: "config/mpxv/include.txt"
reference: "config/reference.fasta"
...
filter:
    exclude: "config/exclude_accessions.txt"

rsv uses the top level files key and top level file name keys

exclude: "config/outliers.txt"
description: "config/description.md"
...
files:
    color_schemes: "config/colors.tsv"
    auspice_config: "config/auspice_config.json"

Open questions

  1. Is providing config files through the config YAML still the way forward?
  2. Do we want to standardize config file schemas across pathogens? Should this be decided per pathogen repo?
  3. If we do standardize config schemas, should it use the top level files key, top level file name keys, or rule specific file name keys? Are there other suggested schemas?
joverlee521 commented 9 months ago

(1) I'm slightly wary of Snakemake's different interpretations of relative paths, but this should be okay as long as config files are always used as inputs and we use hardcoded workdir directives. (2) I don't think we can have a completely standardized schema across repos, but we can at least standardize the schema for the purpose of providing config files. (3) Copying my previous comment: Grouping files together makes sense if expected to be used multiple times. I also see the other side of it being confusing to need to edit configs for the filter rule under both files and filter. It's also not be immediately obvious looking at the config what rules use which files (unless you have detailed docs like ncov).

joverlee521 commented 9 months ago

Copying over @jameshadfield 's relevant comment:

I have spent plenty of time following the strings to work out which rules use which files (mainly in ncov), and often finding a bunch of files no longer used by any rule. Docs are great, but from experience they inevitably fall out of sync. How about something like the following (or a variant thereof)?

files:
  exclude: &files_exclude 'config/exclude_accessions.txt'
filter:
  exclude: *files_exclude
tsibley commented 9 months ago

This is specifically for discussing config files required for workflows such as reference.gb, exclude.txt, include.txt, etc.

One quick point which I think can help with clarity: I'd call these input files or similar since config files already means the YAML/JSON files that are loaded into the workflow config variable.

  1. Is providing config files through the config YAML still the way forward?

… how else would they be provided? If we revert back to a rule files with a params block (e.g.), how would alternative files be specified in a workflow invocation?

I'm slightly wary of Snakemake's different interpretations of relative paths […]

That's very fair! If really needed, I think we could have utilities to standardize paths inside the workflow. But I think/hope it wouldn't be necessary with due care around inputs and workdir or similar.

  1. Do we want to standardize config file schemas across pathogens? Should this be decided per pathogen repo?

I don't think we can have a completely standardized schema across repos, but we can at least standardize the schema for the purpose of providing config files.

Agreed.

  1. If we do standardize config schemas, should it use the top level files key, top level file name keys, or rule specific file name keys? Are there other suggested schemas?

To me, files implies that it will contain all the input files used by the workflow, and that sounds both not currently true and impractical. I'd think to use top-level keys or rule-specific keys, with a preference for top-level keys. The preference for top-level keys is that what rule they're used in is ultimately an implementation detail. For example, the "exclude" file is to me conceptually connected most strongly to the input metadata and sequence files; it's a set of things to skip from them. If you provide vastly different metadata/sequences, then you'll likely need to provide different exclusions too.