nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Add E gene trees #18

Closed j23414 closed 3 months ago

j23414 commented 7 months ago

Description of proposed changes

The goal of this PR is to add dengue E gene trees (e.g. dengue_denv1_E.json, ... dengue_denv4_E.json) in response to feedback that certain locations may only provide the E sequences. To maintain clarity, E gene specific rules that differ from the standard genome rules are placed in separate *_E.smk files. Shared rules are consolidated using wildcards for streamlined implementation where possible.

General steps to support E gene trees are outlined as follows:

  1. Use newreference.py from the RSV pipeline to generate reference files (gb, fasta) specific to the E gene
  2. Use Nextclade v3 to extract E gene sequences from the full dataset
  3. Add a filter to exclude sequences shorter than 1000nt, reducing the risk of misclassification
  4. Add E gene rules as separate *_E.smk files
  5. Consolidate and merge redundant rules using Snakemake wildcards

Related issue(s)

https://github.com/nextstrain/dengue/issues/17

Checklist

j23414 commented 7 months ago

Thanks for checking! The distinction lies in the fact that the E gene build excludes the "augur clades" command, resulting in the absence of a clades_{serotype}_E.json file in the E gene export rule.

Instead, we'd rely on "augur traits" and a metadata column for annotating types and subtypes.

corneliusroemer commented 7 months ago

The distinction lies in the fact that the E gene build excludes the "augur clades" command, resulting in the absence of a clades_{serotype}_E.json file in the E gene export rule.

You could either create an empty dummy file in augur clades when augur clades is called for E or make the input conditional in export. While I'm in general very much in favor of low complexity of snakemake workflows and don't mind duplication if it reduces obscurity, here it should be straightforward to pack into a little bit of Python.

Do you see what I mean? You can embed a little shell script in the clades rule. I do that all the time, e.g. here I use a little bash if then to make the rule do different things based on a parameter. You could adapt this to test for he value of the wildcard.

rule preprocess_clades:
    input:
        clades="builds/clades{clade_type}.tsv",
        outgroup="profiles/clades/{build_name}/outgroup.tsv",
    output:
        clades="builds/{build_name}/clades{clade_type}.tsv",
    wildcard_constraints:
        clade_type=".*",  # Snakemake wildcard default is ".+" which doesn't match empty strings
    params:
        strain_set=lambda w: config["strainSet"][w.build_name],
    shell:
        """
        cp {input.clades} {output.clades};
        cat <(echo) {input.outgroup} >> {output.clades};
        if [ {params.strain_set} = 21L ]; then
            for clade in 19A 19B 20A 20B 20C 20D 20E 20F 20G 20H 20I \
                20J 21A 21B 21C 21D 21E 21F 21G 21H 21I 21J 21K 21M \
                Alpha Beta Gamma Delta Epsilon Eta Theta Iota Kappa Lambda Mu;
            do
                sed -i "/$clade/d" {output.clades};
            done
        fi
        """
jameshadfield commented 7 months ago

Yeah I agree with Cornelius here.

or make the input conditional in export

As an example of this, you can provide a function instead of a list of node-data JSONs in export and then have the function generate the list of node data JSONs conditional on the wildcards. This has benefits when visualising the DAG, as the clades rule won't appear in the graph for the E gene target.

j23414 commented 7 months ago

Thanks @corneliusroemer and @jameshadfield! Explored various suggested approaches, outlined below:

  1. Adding a preprocess_clades rule: Noticed that not every subtype in clades_serotype.tsv has E gene defining mutations listed (e.g. DENV1/V, DENV2/AII). Therefore, achieving a comprehensive solution may involve specifically identifying the E gene mutations for each subtype. This could be a potential task for later exploration.
  2. Creating a dummy file in augur clades: Encountered a failure with "augur export" on empty clades_{serotype}_E.json files which is expected from prior github discussion threads. However, I didn't dig too deeply into determining the minimal content required for the empty file, hoping for a more elegant solution through an alternative approach.
  3. Defining a node_data_files function similar to hepB: Ended up using this approach. Instead of placing the wildcard pattern in a separate function, opted to define the wildcard conditional inline.

Defining the wildcard conditional inline had the added benefit of consolidating down the augur translate rule. Thanks for the suggested solutions! This is open to further review and discussion.

j23414 commented 4 months ago

Placeholder of staged builds to help me visually check them, will update links to latest run soon.

genome E gene
all all/genome all/E
denv1 denv1/genome denv1/E
denv2 denv2/genome denv2/E
denv3 denv3/genome denv3/E
denv4 denv4/genome denv4/E
j23414 commented 3 months ago

This PR has been superseded by merged PRs:

Closing since no longer needed