theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
33 stars 15 forks source link

Output whole genome SNP matrix for Snippy_Streamline and Snippy_Tree workflows even when core_genome used #351

Closed jrotieno closed 4 months ago

jrotieno commented 4 months ago

This PR closes #339.

🗑️ This dev branch should NOT be deleted after merging to main.

:brain: Aim, Context and Functionality

Knowing pairwise wgSNP distances is critical for assessing epi relationships, even when a core-genome tree is used. Currently, users have to run Snippy_Streamline_PHB or Snippy_Tree_PHB twice to get both a core tree and whole-genome SNP matrix, which is a bit cumbersome. This PR provides wgSNP matrix regardless of whether core or whole genome tree is made (in addition to cgSNP matrix if core_genome is used)

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes

New outputs produced for the Snippy_Streamline_PHB and Snippy_Tree_PHB workflows to produce wgSNP matrix regardless of whether core or whole genome tree is made (in addition to cgSNP matrix if core_genome is used)

The following workflows were impacted by adding the additional optional input midpoint_root_tree, but does not change the workflows in any way, unless the user changes the previous behaviour of the workflow by having the tree not mid-point rooted in the SNP matrix re-ordering task: Augur_PHB kSNP3_PHB Core_Gene_SNP_PHB MashTree_FASTA_PHB

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: N/A

Databases or database versions changed: N/A

Data processing/commands changed: Yes

We have added an optional input midpoint_root_tree that determines whether the tree used in the SNP matrix re-ordering task should be re-rooted or not. The default option is true, i.e. the tree will be mid-point rooted.

File processing changed: Yes Additional outputs as described above

Compute resources changed: N/A

➡️ Inputs

Added optional input midpoint_root_tree

⬅️ Outputs

The old output snippy_snp_matrix has been replaced with two new outputs snippy_wg_snp_matrix snippy_cg_snp_matrix

:test_tube: Testing

Test Dataset

Commandline Testing with MiniWDL or Cromwell (optional)

WDL Code passed check tests

Terra Testing

Snippy_Streamline_PHB with core_genome option set to true: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Doughty_Sandbox/job_history/a7295ae7-9980-4d82-97af-c550afbfcaef

Snippy_Streamline_PHB with core_genome option set to false: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Doughty_Sandbox/job_history/810ce538-aad6-41e5-a8c8-ee4b46d7de1c

Checks for changes in the kSNP3_PHB workflow: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Doughty_Sandbox/job_history/72d5ce6a-48d3-4d0a-84c3-857c56cb5454

Checks for changes in the MashTree_FASTA_PHB workflow: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Doughty_Sandbox/job_history/725b5847-7840-4617-8dd4-b8cb0c810d0d

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

sage-wright commented 4 months ago

Testing the following scenarios; trees are compared to validation trees: