Snippy_Streamline: snp-dists does not seem to use core genome by default

sam-baird commented 1 year ago

Hello Theiagen Team,

I'm performing SNP and phylogenetic analysis for a set of bacterial samples on Terra using PHB v1.0.1. When I compared the core genome SNP matrix output from the kSNP3 workflow to the SNP matrix output from the Snippy_Streamline workflow, I was surprised to see that the pairwise SNP distances were significantly higher on average for Snippy_Streamline. I had assumed by default that a core genome alignment was used by snp-dists for Snippy_Streamline because the docs on Notion indicate that the default value for core_genome is true, and I had not explicitly set this attribute. When I reran the workflow with core_genome explicitly set to true, the SNP distances were much closer to the kSNP3 core genome SNP distances.

Looking at wf_snippy_tree.wdl, in the case of not explicitly setting core_genome, the snp-sites task is skipped:

https://github.com/theiagen/public_health_bioinformatics/blob/5a68417767bbb53f6b6a303e22c8092e4f8b4031/workflows/phylogenetics/wf_snippy_tree.wdl#L70-L74

Then the Gubbins polymorphic FASTA is used as the input instead to snp-dists (assuming use_gubbins is set to default of true):

https://github.com/theiagen/public_health_bioinformatics/blob/5a68417767bbb53f6b6a303e22c8092e4f8b4031/workflows/phylogenetics/wf_snippy_tree.wdl#L104-L106

I'm not sure if the Gubbins polymorphic FASTA is based on a pan genome alignment rather than a core genome alignment, but it looks like a pan genome alignment since there are gaps (-) in the alignment FASTA. It seems like the default behavior should be to use the core genome alignment from snp-sites by default

sage-wright commented 1 year ago

Hey Sam! Thanks for letting us know about this issue. We'll be taking a look in more detail on Monday, but I wanted to let you know we have seen this and are working on a resolution!

kapsakcj commented 1 year ago

Hey Sam, thanks for raising this issue. You are correct, in v1.0.1 (and prior versions) core_genome boolean is unset (but acts as false) by default and our docs say the opposite. We will correct this in the docs.

But to get at the deeper issue - we will be updating the workflow to have core_genome set to true by default and thus the snp_sites core genome alignment will be generated.

We're also planning to update the iqtree2 task so that modelfinder is automatically run (unless the user defines their own model). We previously had some logic to select a model based on the core_genome input, but now it will be up to the user to define their own or allow modelfinder to do its thing.

[x] update Notion (PHB v1.0.1 and PHB main docs) docs to reflect what is actually happening in workflow. Mainly that core_genome is set to true
[x] iqtree2 task changes tasks/phylogenetic_inference/task_iqtree2.wdl
- [x] remove conditional section on core_genome and selecting model based on that boolean. Update task to simply use user-defined model or allow model finder to run.
- [x] remove core_genome as input to task
[x] set core_genome = true by default in snippy_tree workflow

kapsakcj commented 1 year ago

@sam-baird Thanks for raising this issue. We have merged the PR, so you can now use the updated Snippy_Streamline workflow by using the main branch in Terra.

The core_genome input is now set to true by default and there is no default model for iqtree2, so if you would like to use a specific model, you will need to provide it as an optional input for iqtree2_model input param. Otherwise iqtree2 will run its modelfinder & automatically choose a model for you.

These changes will be incorporated into the next version release, but we don't have a timeline for that just yet. It may be another few weeks before we release a new version.

Let us know if you have any questions!

theiagen / public_health_bioinformatics

Snippy_Streamline: snp-dists does not seem to use core genome by default #143