Snakemake improvements - Githubissues

jameshadfield commented 1 month ago

This is a series of observations about the current situation of the workflow and the complexities we've recently introduced to it as we've adapted it to handle the ongoing cattle-flu outbreak. This includes complexities added in (currently unmerged) Segment builds for H5N1 cattle flu outbreak. The suggestions here are roughly ordered as each builds on previous ones.

Config files

We've probably reached the point where config files would be clarifying rather than confusing. As @joverlee521 notes:

the main Snakefile would be simplified a lot if the param functions were pulled out into config files. If the params were defined in config files, we would be able to remove a lot of the if w.subtype=='h5n1-cattle-outbreak' checks.

I'm imagining here we would have config files for:

One each for H5N1, H5Nx, H7N9, H9N2
h5n1-cattle-outbreak genome build
h5n1-cattle-outbreak segments built (maybe this could be combined with the genome config?)

Optional `time` wildcard.

The per-segment cattle-flu builds don't have this wildcard, which is why the filename produced by the pipeline doesn't match the desired URL. We can maintain the current ruse of using time=all-time and add a final rule to rename the files or somesuch, but the cleaner way would be to make the wildcard optional. In practice (I think, haven't prototyped) this would mean that the three values of time would be "", "_all-time", "_2y".

Bring genome build into canonical snakemake pipeline

With the above changes, merging these pipelines would be a nice simplification. While they use different parameters, all the rules from tree onwards are the same between the pipelines. I think this could be nicely organised if the input to align becomes a function and when w.subtype=h5n1-cattle-outbreak that function refers to outputs of a rules/h5n1-cattle-outbreak-genome.smk ruleset.

Directory structure within results & config

This is rather minor, but the files counts are starting to become hard to manage! I'd think splitting config/ into one directory per config file (as described above) would work nicely. Similarly for results/.

Data files include source in filename

Currently no matter the upstream source (local ingest, GISIAD data, NCBI data) we get the following data files:

data/metadata.tsv
data/{segment}/sequences.fasta
data/metadata_{subtype}.tsv
data/sequences_{subtype}_{segment}fasta

This means when switching between a GISAID build (H5Nx, H5N1, H7N9, H9N2) to a NCBI build (h5n1-cattle-outbreak) these files will be overwritten¹, and then when you switch back to a GISAID build the entire pipeline needs to rerun.

I think it'd be clarifying to have the per-subtype files (3,4) in results/{subtype}/{metadata.tsv,sequences.fasta}. The rule that produces these would itself have inputs parameterised by the upstream source. Some pseudocode to explain what I'm thinking:

config['s3_upstream'] = {
    'name': 'ncbi-h5n1',
    'sequences': 's3://nextstrain-data/files/workflows/avian-flu/h5n1/sequences.fasta.zst',
    'metadata': 's3://nextstrain-data/files/workflows/avian-flu/h5n1/metadata.tsv.zst',
}

rule fetch_from_s3:
    output:
        f"data/{config['s3_upstream']['name']}/metadata.tsv",
        f"data/{config['s3_upstream']['name']}/sequences.fasta"
    shell:
        ...

def upstream_metadata():
    if LOCAL_INGEST:
        return f"ingest/{INGEST_SOURCE}/results/metadata.tsv",
    return f"data/{config['s3_upstream']['name']}/metadata.tsv",

rule filter_metadata_by_subtype:
    input:
        metadata = upstream_metadata,
    output:
        metadata = "results/{subtype}/metadata.tsv",
    ...

¹ It looks like Snakemake is smart enough to detect when a invocation uses different config params (s3_src etc) than those which produced the existing files in data and regenerates them accordingly. This was a nice surprise to me!

joverlee521 commented 1 month ago

One each for H5N1, H5Nx, H7N9, H9N2

In principle I agree with this, but want to be explicit that is will be a significant change in how builds are run. Instead of the current single invocation of nexstrain build, this will require 4 separate ones. This isn't an issue if the builds were automated, but seems like they are still being manually run/maintained so keeping track of 4 separate jobs might be too onerous.

As an alternative, we could potentially follow seasonal-flu's example of one config for all 4 subtypes, but that makes the config file more complex...

the cleaner way would be to make the wildcard optional.

Unfortunately, it's not possible to have an empty wildcard in Snakemake Testing with minimal Snakefile:

rule all:
    input: expand("output{test_wildcard}.txt", test_wildcard=["_a", "_b", ""])

rule a:
    output: touch("output{test_wildcard}.txt")

$ nextstrain build snakemake/empty-wildcard/ --forceall
Building DAG of jobs...
MissingInputException in rule all in file /nextstrain/build/Snakefile, line 1:
Missing input files for rule all:
    affected files:
        output.txt

Since we are adding segment builds for the cattle-outbreak, I wonder if the cattle-outbreak builds should be nested under the canonical h5n1 builds. Then cattle-outbreak would be considered the "time".

jameshadfield commented 1 month ago

Unfortunately, it's not possible to have an empty wildcard in Snakemake

We can do it, but I haven't thought through all the ramifications. Add the following to your minimal Snakefile:

wildcard_constraints:
    test_wildcard=".*"

Instead of the current single invocation of nexstrain build, this will require 4 separate ones.

Oh! Yeah, that may be a dealbreaker

nextstrain / avian-flu

Snakemake improvements #70

Config files

Optional `time` wildcard.

Bring genome build into canonical snakemake pipeline

Directory structure within results & config

Data files include source in filename

nextstrain / avian-flu

Snakemake improvements #70

Config files

Optional time wildcard.

Bring genome build into canonical snakemake pipeline

Directory structure within results & config

Data files include source in filename

Optional `time` wildcard.