theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
34 stars 16 forks source link

Shared variants tasks and QC improvements for kSNP3 and Snippy #291

Closed michellescribner closed 4 months ago

michellescribner commented 7 months ago

Closes https://github.com/theiagen/public_health_bioinformatics/issues/258

:hammer_and_wrench: Changes Being Made

This PR makes a number of changes to kSNP3 and Snippy-related workflows with the broad goal of creating summary files that show the SNPs shared among a set of samples and adding relevant QC assessments.

Adds a cat_files task which will concatenate variant files

samplename CHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS EFFECT LOCUS_TAG GENE PRODUCT
sample1 PEKT02000007 5224 snp C G G:21 C:0                
sample2 PEKT02000007 34112 snp C G G:32 C:0 CDS + 153/1620 51/539 missense_variant c.153C>G p.His51Gln B9J08_002604 hypothetical protein
sample3 PEKT02000007 34487 snp T A A:41 T:0 CDS + 528/1620 176/539 missense_variant c.528T>A p.Asn176Lys B9J08_002604 hypothetical protein

Adds a new task tasks/phylogenetic_inference/utilities/task_shared_variants.wdl

CHROM POS TYPE REF ALT FTYPE STRAND NT_POS AA_POS EFFECT LOCUS_TAG GENE PRODUCT sample1 sample2 sample3
PEKT02000007 2693938 snp T C CDS - 1008/3000 336/999 synonymous_variant c.1008A>G p.Lys336Lys B9J08_003879 NA chitin synthase 1 1 1 0
PEKT02000007 2529234 snp G C CDS + 282/336 94/111 missense_variant c.282G>C p.Lys94Asn B9J08_003804 NA cytochrome c 1 1 1
PEKT02000002 1043926 snp A G CDS - 542/1464 181/487 missense_variant c.542T>C p.Ile181Thr B9J08_000976 NA dihydrolipoyl dehydrogenase 1 1 0

Adds Find_Shared_Variants_PHB standalone workflow

Adds cat_variants task and shared_variants task to workflows/phylogenetics/wf_snippy_tree.wdl as an optional modules

Adds snippy variants task QC improvements

Adds kSNP3 task QC improvements

Adds kSNP3 Shared SNP Task

Impacted Workflows/Tasks

The following tasks and workflows are modified based on the changes described above. All TheiaProk workflows are also impacted due to the changes to Merlin Magic.

:brain: Context and Rationale

Broadly, all of these changes were made to improve the ability of the user to assess the quality of phylogenetic analyses using the kSNP3 and Snippy workflows.

:clipboard: Workflow/Task Steps

Inputs

Outputs

New outputs for workflows that invoke snippy tree: (snippy tree, snippy streamline)

- snippy_concatenated_variants
- snippy_shared_variants

New outputs for workflows that invoke the snippy variants task: (snippy_variants standalone, theiaeuk, merlin magic)

- snippy_variants_num_reads_aligned
- snippy_variants_num_variants
- snippy_variants_coverage_tsv
- snippy_variants_percent_ref_coverage

New outputs for kSNP3:

- ksnp3_number_snps
- ksnp3_number_core_snps
- ksnp3_core_snp_table

Impacted Outputs

None

:test_tube: Testing

Locally

Terra

Scenarios for Reviewer to Test

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

emmadoughty commented 5 months ago

This is a beast! Initiating function testing for the following scenarios and datasets:

Do not intend to test:

@michellescribner, does this cover all the scenarios to test? Is the Find_Shared_Variants_PHB workflow intended to be run independently by users?

emmadoughty commented 5 months ago

@michellescribner This PR has a lot to test, so I am providing some initial feedback for discussion before testing all scenarios:

michellescribner commented 5 months ago

@emmadoughty Thank you so much for all of your comments!! Answers to come below... When setting up the workflow, an optional input, include_gbff being set to true doesn't seem to lead to a gbff (gbk) file being used as the reference, nor being output to the Terra table. I'm not sure what this input is doing, but if we could use the gbk file as the input reference genome (this is feasible for snippy), we would get the gene annotations in the - snippy_concatenated_snps and snippy_shared_snps output files, which I think would be very helpful for users (rather than simply genome positions). -To my limited knowledge, include_gbff being set to true causes the ncbi_datasets task to download both the fasta file and the gbk file within Snippy Streamline, but the fasta is still the file passed on to the snippy variants task. I already provide gbk files as the input reference genome for when I directly provide a reference genome to the workflow, and can confirm that you wind up receiving the gene annotations in the concatenated snps file, so this is definitely already possible! I like your idea to make it possible for Snippy Streamline to use the gbk but not sure if we want to do that in this branch or create a new one.

michellescribner commented 5 months ago

Is there a reason that call_shared_variants = false? This seem like a very useful output so it would be good to have it by default

The file name for the snippy_concatenated_variants output is just the tree_name input. Could we make this file name more descriptive and give it a .csv file extension so that it opens up in Excel properly?

michellescribner commented 5 months ago

Note, the outputs on Terra are snippy_concatenated_variants and snippy_shared_variants_table, using the word "variants" rather than "snps" as mentioned in the PR.

michellescribner commented 5 months ago

The output for snippy_variants_num_reads_aligned and other outputs coming from Snippy_Variants are simply listed as below. Whilst helpful for checking if there are any samples with insufficient quality to include in the tree, it doesn't help to identify which samples to exclude. Could each result perhaps be appended with the sample name?

image

michellescribner commented 5 months ago

Similarly, snippy_variants_coverage_tsv outputs don't seem to be particularly helpful as a Snippy_Streamline output as the set-level output is just giving a list of GS URIs that aren't easy to access. Would it be worth removing this output for Snippy_Streamline?

emmadoughty commented 5 months ago

@michellescribner I have updated my comment above to reflect the scenarios to test. I hadn't realized that the QC outputs were not intended to be in Snippy_Streamline. That said, I think they would be incredibly useful in this workflow so I'm going to open a feature request issue to integrate this. I'm also going to open an enhancement issue to use the gbk file from reference seeker

michellescribner commented 4 months ago

Relaunched tests post modifications above

Tests launched on 23 C. auris specimens: TheiaEuk_Illumina_PE_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/93b97e30-18d6-407d-8c44-d33f45be4ad9 Snippy_Variants_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/e468f95d-dd3b-486c-992e-b82c8b4c2468 Snippy_Tree_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/7f6daae1-85bd-425a-be4e-cb2efb548dde kSNP3_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/f352cf52-79ff-44e3-ac86-8af9f3ec2824 -Verified that 3 new output columns are present and core tree remains identical to PHB v1.3 Snippy_Streamline_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/b366d926-27f4-4935-94e4-cbacb0156ab7

Other function tests to make sure nothing broke: TheiaProk_Illumina_SE_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/4fd029dc-d1a0-4333-8aa4-c06fe52991fc TheiaProk_FASTA_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/b42dd53d-e750-40ee-a184-e387925c6956 TheiaProk_ONT_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/4b30117b-c952-4545-871e-c101e44030cb Concatenate_Column_Content_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/43082c2a-b4a4-47a6-8fca-fdc089427b25 TheiaCoV_Illumina_PE_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/3e619181-cd6f-41c0-88fe-c66842c7280d Snippy_Streamline_PHB on some bacterial samples: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/5f4f0b53-5f24-4cdc-afb4-640dddb297f3

michellescribner commented 4 months ago

Retesting after silly mistake where I didn't update variable name:

TheiaEuk_Illumina_PE_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/9ace08ba-035a-4d32-bb1b-ce4421e32192 Snippy_Variants_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/9aa4f317-0943-4609-aa46-d181ba8c8988

sage-wright commented 4 months ago

Code changes approved.