Closed michellescribner closed 4 months ago
This is a beast! Initiating function testing for the following scenarios and datasets:
Do not intend to test:
@michellescribner, does this cover all the scenarios to test? Is the Find_Shared_Variants_PHB workflow intended to be run independently by users?
@michellescribner This PR has a lot to test, so I am providing some initial feedback for discussion before testing all scenarios:
include_gbff
being set to true doesn't seem to lead to a gbff (gbk) file being used as the reference, nor being output to the Terra table. I'm not sure what this input is doing, but if we could use the gbk file as the input reference genome (this is feasible for snippy), we would get the gene annotations in the - snippy_concatenated_snps
and snippy_shared_snps
output files, which I think would be very helpful for users (rather than simply genome positions).call_shared_variants
= false? This seem like a very useful output so it would be good to have it by defaultsnippy_concatenated_variants
output is just the tree_name
input. Could we make this file name more descriptive and give it a .csv file extension so that it opens up in Excel properly?snippy_concatenated_variants
and snippy_shared_variants_table
, using the word "variants" rather than "snps" as mentioned in the PR.snippy_variants_num_reads_aligned
and other outputs coming from Snippy_Variants are simply listed as below. Whilst helpful for checking if there are any samples with insufficient quality to include in the tree, it doesn't help to identify which samples to exclude. Could each result perhaps be appended with the sample name?
snippy_variants_coverage_tsv
outputs don't seem to be particularly helpful as a Snippy_Streamline output as the set-level output is just giving a list of GS URIs that aren't easy to access. Would it be worth removing this output for Snippy_Streamline?@emmadoughty Thank you so much for all of your comments!! Answers to come below...
When setting up the workflow, an optional input, include_gbff being set to true doesn't seem to lead to a gbff (gbk) file being used as the reference, nor being output to the Terra table. I'm not sure what this input is doing, but if we could use the gbk file as the input reference genome (this is feasible for snippy), we would get the gene annotations in the - snippy_concatenated_snps and snippy_shared_snps output files, which I think would be very helpful for users (rather than simply genome positions).
-To my limited knowledge, include_gbff being set to true causes the ncbi_datasets task to download both the fasta file and the gbk file within Snippy Streamline, but the fasta is still the file passed on to the snippy variants task. I already provide gbk files as the input reference genome for when I directly provide a reference genome to the workflow, and can confirm that you wind up receiving the gene annotations in the concatenated snps file, so this is definitely already possible! I like your idea to make it possible for Snippy Streamline to use the gbk but not sure if we want to do that in this branch or create a new one.
Is there a reason that call_shared_variants = false? This seem like a very useful output so it would be good to have it by default
The file name for the snippy_concatenated_variants output is just the tree_name input. Could we make this file name more descriptive and give it a .csv file extension so that it opens up in Excel properly?
cat_files
task to concatenate the variants from multiple samples into one file, and this is the existing name of the file that is outputted by the task. I couldn't figure out how to add a ".csv" extension or other name to the file without altering the name of all files that would be concatenated with this task (some of which are not csvs). There might be a very easy solution to this, I will ask other folks!Note, the outputs on Terra are snippy_concatenated_variants and snippy_shared_variants_table, using the word "variants" rather than "snps" as mentioned in the PR.
The output for snippy_variants_num_reads_aligned and other outputs coming from Snippy_Variants are simply listed as below. Whilst helpful for checking if there are any samples with insufficient quality to include in the tree, it doesn't help to identify which samples to exclude. Could each result perhaps be appended with the sample name?
Similarly, snippy_variants_coverage_tsv outputs don't seem to be particularly helpful as a Snippy_Streamline output as the set-level output is just giving a list of GS URIs that aren't easy to access. Would it be worth removing this output for Snippy_Streamline?
@michellescribner I have updated my comment above to reflect the scenarios to test. I hadn't realized that the QC outputs were not intended to be in Snippy_Streamline. That said, I think they would be incredibly useful in this workflow so I'm going to open a feature request issue to integrate this. I'm also going to open an enhancement issue to use the gbk file from reference seeker
Relaunched tests post modifications above
Tests launched on 23 C. auris specimens: TheiaEuk_Illumina_PE_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/93b97e30-18d6-407d-8c44-d33f45be4ad9 Snippy_Variants_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/e468f95d-dd3b-486c-992e-b82c8b4c2468 Snippy_Tree_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/7f6daae1-85bd-425a-be4e-cb2efb548dde kSNP3_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/f352cf52-79ff-44e3-ac86-8af9f3ec2824 -Verified that 3 new output columns are present and core tree remains identical to PHB v1.3 Snippy_Streamline_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/b366d926-27f4-4935-94e4-cbacb0156ab7
Other function tests to make sure nothing broke: TheiaProk_Illumina_SE_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/4fd029dc-d1a0-4333-8aa4-c06fe52991fc TheiaProk_FASTA_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/b42dd53d-e750-40ee-a184-e387925c6956 TheiaProk_ONT_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/4b30117b-c952-4545-871e-c101e44030cb Concatenate_Column_Content_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/43082c2a-b4a4-47a6-8fca-fdc089427b25 TheiaCoV_Illumina_PE_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/3e619181-cd6f-41c0-88fe-c66842c7280d Snippy_Streamline_PHB on some bacterial samples: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/5f4f0b53-5f24-4cdc-afb4-640dddb297f3
Retesting after silly mistake where I didn't update variable name:
TheiaEuk_Illumina_PE_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/9ace08ba-035a-4d32-bb1b-ce4421e32192 Snippy_Variants_PHB: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/9aa4f317-0943-4609-aa46-d181ba8c8988
Closes https://github.com/theiagen/public_health_bioinformatics/issues/258
:hammer_and_wrench: Changes Being Made
This PR makes a number of changes to kSNP3 and Snippy-related workflows with the broad goal of creating summary files that show the SNPs shared among a set of samples and adding relevant QC assessments.
Adds a
cat_files
task which will concatenate variant filescat_files
file/tasks/utilities/file_handling/task_cat_files.wdl
, adds a taskcat_variants
, which is similar to thecat_files
task except adds the sample names as an additional column in the concatenated fileAdds a new task
tasks/phylogenetic_inference/utilities/task_shared_variants.wdl
shared_variants_table
.Adds Find_Shared_Variants_PHB standalone workflow
cat_variants
task above. It then reshapes this output using theshared_variants
task above.Adds
cat_variants
task andshared_variants
task to workflows/phylogenetics/wf_snippy_tree.wdl as an optional modulescall_shared_variants
= truetasks/phylogenetic_inference/task_snippy_core.wdl
to glob the snippy results csvs after untar-ing snippy results tarballsAdds snippy variants task QC improvements
samtools view -c
samtools coverage
samtools depth
Adds kSNP3 task QC improvements
ksnp3_number_snps
ksnp3_number_core_snps
Adds kSNP3 Shared SNP Task
tasks/phylogenetic_inference/utilities/task_ksnp3_shared_snps.wdl
which takes inksnp3_vcf_ref_genome file
as input, filters for only core SNPs, sorts SNPs by number of occurrences in the sample set, and outputs the table as a csv fileksnp3_core_snp_table
Impacted Workflows/Tasks
The following tasks and workflows are modified based on the changes described above. All TheiaProk workflows are also impacted due to the changes to Merlin Magic.
Modifies file
tasks/utilities/file_handling/task_cat_files.wdl
to add new taskcat_variants
Adds new task
tasks/phylogenetic_inference/utilities/task_shared_variants.wdl
Adds new workflow
workflows/utilities/file_handling/wf_find_shared_variants.wdl
Adds Find_Shared_Variants_PHB workflow to
dockstore.yml
Adds shared variants task to Snippy Tree wf
workflows/phylogenetics/wf_snippy_tree.wdl
Globs results csvs from snippy core task for downstream use by concatenate variants
tasks/phylogenetic_inference/task_snippy_core.wdl
Adds new qc columns to
tasks/gene_typing/task_snippy_variants.wdl
Adds snippy variants qc columns to
workflows/standalone_modules/wf_snippy_variants.wdl
Adds snippy variants qc columns to
workflows/utilities/wf_merlin_magic.wdl
Adds snippy variants qc columns to
workflows/theiaeuk/wf_theiaeuk_illumina_pe.wdl
Adds new qc columns to
tasks/phylogenetic_inference/task_ksnp3.wdl
Adds new task for creating shared SNP table
tasks/phylogenetic_inference/utilities/task_ksnp3_shared_snps.wdl
Adds shared SNP task to kSNP3 workflow and new qc columns
workflows/phylogenetics/wf_ksnp3.wdl
:brain: Context and Rationale
Broadly, all of these changes were made to improve the ability of the user to assess the quality of phylogenetic analyses using the kSNP3 and Snippy workflows.
workflows/utilities/file_handling/wf_concatenate_column.wdl
workflow does not include sample information in the final csv. This workflow will therefore enable users to concatenate snippy results across multiple samples effectively.:clipboard: Workflow/Task Steps
Inputs
Outputs
New outputs for workflows that invoke snippy tree: (snippy tree, snippy streamline)
New outputs for workflows that invoke the snippy variants task: (snippy_variants standalone, theiaeuk, merlin magic)
New outputs for kSNP3:
Impacted Outputs
None
:test_tube: Testing
Locally
Terra
Scenarios for Reviewer to Test
:microscope: Final Developer Checklist
🎯 Reviewer Checklist