🗑️ This dev branch should be deleted after merging to main.
:brain: Aim, Context and Functionality
This PR adds emmtyper to the TheiaProk workflows to perform emm gene typing on Streptococcus pyogenes assemblies. TheiaProk_Illumina_PE_PHB already contains a similar task using the read-based emmtyping_tool, which is only compatible with paired-end reads. Since emmtyper works on assemblies, it is compatible with the paired-end, single-end, FASTA, and ONT versions of the TheiaProk workflows. Also, there are sometimes differences between the two tools' outputs (see validation test results below), so I think it is useful to make both tools available.
:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made
This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes, there are now a few additional outputs that will be filled if the species is identified as Streptococcus pyogenes, but the change will not affect existing results.
Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No
:clipboard: Workflow/Task Step Changes
task_emmtyper.wdl already existed in this repo, but was not being used in any workflows. I made some changes to the existing task and updated the TheiaProk workflows to use the task.
🔄 Data Processing
Docker/software or software versions changed: None
Databases or database versions changed: None
Data processing/commands changed: The task now parses and outputs the "final" emm type from the output TSV. The task now uses --output-format verbose TSV format instead of the default short format.
File processing changed: None
Compute resources changed: None
➡️ Inputs
None.
#### ⬅️ Outputs
The TheiaProk workflows have these new outputs: emmtyper_emm_type, emmtyper_results_tsv, emmtyper_version, and emmtyper_docker.
:test_tube: Testing
Test Dataset
The workflows were tested using 32-sample validation set of isolates with known emm types provided by the CDC Strep Lab. The reads were fetched from NCBI using the SRA_Fetch_PHB workflow.
Commandline Testing with MiniWDL or Cromwell (optional)
None.
Terra Testing
I performed Terra testing for the TheiaProk_Illumina_PE_PHB, TheiaProk_Illumina_SE_PHB, and TheiaProk_FASTA_PHB workflows in separate data tables with call caching turned off. For testing TheiaProk_Illumina_SE_PHB, only read1 was used. For testing TheiaProk_FASTA_PHB, the assembly_fasta outputs from TheiaProk_Illumina_PE were used.
Comparison table of emm type results
BioSample
Expected emm type
TheiaProk PE emmtyper
TheiaProk PE emmtypingtool
TheiaProk SE emmtyper
TheiaProk FASTA emmtyper
SAMN26504544
83.1
EMM83.1
emm83.1.sds
EMM83.1
EMM83.1
SAMN26504545
76.0
EMM76.0
emm76.0.sds
EMM76.0
EMM76.0
SAMN26504546
4.0
EMM4.0
emm4.0.sds
EMM4.0
EMM4.0
SAMN26504547
73.7
EMM73.7
emm73 ⚠️
EMM73.7
EMM73.7
SAMN26504548
12.0
EMM12.0
emm12.0.sds
EMM12.0
EMM12.0
SAMN26504549
4.0
EMM4.0
emm4.0.sds
EMM4.0
EMM4.0
SAMN26504550
183.1
EMM183.1
emm183.1.sds/emm149.0.sds ⚠️
EMM183.1
EMM183.1
SAMN26504551
77.0
EMM77.0
emm77.0.sds
EMM77.0
EMM77.0
SAMN26504568
28.0
EMM28.0
emm28.0.sds
EMM28.0
EMM28.0
SAMN26504569
2.0
EMM2.0
emm2.0.sds
EMM2.0
EMM2.0
SAMN26504552
12.0
EMM12.0
emm12.0.sds
EMM12.0
EMM12.0
SAMN26504553
89.0
EMM89.0
emm89.0.sds
EMM89.0
EMM89.0
SAMN26504554
89.0
EMM89.0
emm89.0.sds
EMM89.0
EMM89.0
SAMN26504570
2.0
EMM2.0
emm2.0.sds
EMM2.0
EMM2.0
SAMN26504571
2.0
EMM2.0
emm2.0.sds
EMM2.0
EMM2.0
SAMN26504555
75.0
EMM75.0
emm75.0.sds
EMM75.0
EMM75.0
SAMN26504556
77.0
EMM77.0
emm77.0.sds
EMM77.0
EMM77.0
SAMN26504557
77.0
EMM77.0
emm77.0.sds
EMM77.0
EMM77.0
SAMN26504558
92.0
EMM92.0
emm92.0.sds
EMM92.0
EMM92.0
SAMN26504559
92.0
EMM92.0
emm92.0.sds
EMM92.0
EMM92.0
SAMN26504560
90.2
EMM90.2
emm90.2.sds
EMM90.2
EMM90.2
SAMN26504562
59.0
EMM59.0
emm59.0.sds
EMM59.0
EMM59.0
SAMN26504563
83.1
EMM83.1
emm83.1.sds
EMM83.1
EMM83.1
SAMN26504564
77.0
EMM77.0
emm77.0.sds
EMM77.0
EMM77.0
SAMN26504572
11.0
EMM11.0
emm11.0.sds
EMM11.0
EMM11.0
SAMN26504573
11.0
EMM11.0
emm11.0.sds
EMM11.0
EMM11.0
SAMN26504565
28.0
EMM28.0
emm28.0.sds
EMM28.0
EMM28.0
SAMN26504566
92.0
EMM92.0
emm92.0.sds
EMM92.0
EMM92.0
SAMN26504567
73.0
EMM73.0
emm73.0.sds
EMM73.0
EMM73.0
SAMN26504574
1.0
EMM1.0
emm1.0.sds
EMM1.0
EMM1.0
SAMN26504575
6.4
EMM6.4
emm6.4.sds
EMM6.4
EMM6.4
SAMN26504576
89.0
EMM89.0
emm89.0.sds
EMM89.0
EMM89.0
⚠️: SAMN26504547 and SAMN26504550 showed different results between emmtypingtool and emmtyper. Specifically, emmtypingtool did not report the subtype for SAMN26504547, reporting "73" instead of "73.7". emmtypingtool reported two subtypes "183.1/149.0" instead of "183.1" for SAMN26504550. The SAMN26504547 emmtypingtool BAM alignment showed a 3-bp deletion when aligned to emm73.7. This looks to be an artifact caused by emmtypingtool adding 100bp flanking sequences from a reference genome to the 180bp trimmed fragment. The added flanking sequences appear to have caused a change in alignment for a majority of the reads resulting in the 3 bp deletion at the 3' end of the 180 bp region. For SAMN26504550, the extra emm type 149.0 is a known emm-like gene (according to the source code for emmtyper), so this looks like 149.0 should have been excluded.
At least for this dataset, emmtyper seemed to be more accurate since it matched all of the expected emm types.
Suggested Scenarios for Reviewer to Test
I did not test TheiaProk_ONT_PHB workflow.
Theiagen Version Release Testing (optional)
:microscope: Final Developer Checklist
[ ] The workflow/task has been tested locally and results, including file contents, are as anticipated
[x] The workflow/task has been tested on Terra and results, including file contents, are as anticipated
[x] The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
Thanks for this really well written PR! I'm currently running some additional tests on Terra (including TheiaProk_ONT!), and I'm planning on merging this PR upon their successful completion/results!
This PR closes #413.
🗑️ This dev branch should be deleted after merging to main.
:brain: Aim, Context and Functionality
This PR adds emmtyper to the TheiaProk workflows to perform emm gene typing on Streptococcus pyogenes assemblies. TheiaProk_Illumina_PE_PHB already contains a similar task using the read-based emmtyping_tool, which is only compatible with paired-end reads. Since emmtyper works on assemblies, it is compatible with the paired-end, single-end, FASTA, and ONT versions of the TheiaProk workflows. Also, there are sometimes differences between the two tools' outputs (see validation test results below), so I think it is useful to make both tools available.
:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made
This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes, there are now a few additional outputs that will be filled if the species is identified as Streptococcus pyogenes, but the change will not affect existing results.
Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No
:clipboard: Workflow/Task Step Changes
task_emmtyper.wdl
already existed in this repo, but was not being used in any workflows. I made some changes to the existing task and updated the TheiaProk workflows to use the task.🔄 Data Processing
Docker/software or software versions changed: None
Databases or database versions changed: None
Data processing/commands changed: The task now parses and outputs the "final" emm type from the output TSV. The task now uses
--output-format verbose
TSV format instead of the default short format.File processing changed: None
Compute resources changed: None
➡️ Inputs
None. #### ⬅️ OutputsThe TheiaProk workflows have these new outputs:
emmtyper_emm_type
,emmtyper_results_tsv
,emmtyper_version
, andemmtyper_docker
.:test_tube: Testing
Test Dataset
The workflows were tested using 32-sample validation set of isolates with known emm types provided by the CDC Strep Lab. The reads were fetched from NCBI using the SRA_Fetch_PHB workflow.
Commandline Testing with MiniWDL or Cromwell (optional)
None.
Terra Testing
I performed Terra testing for the TheiaProk_Illumina_PE_PHB, TheiaProk_Illumina_SE_PHB, and TheiaProk_FASTA_PHB workflows in separate data tables with call caching turned off. For testing TheiaProk_Illumina_SE_PHB, only
read1
was used. For testing TheiaProk_FASTA_PHB, theassembly_fasta
outputs from TheiaProk_Illumina_PE were used.⚠️: SAMN26504547 and SAMN26504550 showed different results between emmtypingtool and emmtyper. Specifically, emmtypingtool did not report the subtype for SAMN26504547, reporting "73" instead of "73.7". emmtypingtool reported two subtypes "183.1/149.0" instead of "183.1" for SAMN26504550. The SAMN26504547 emmtypingtool BAM alignment showed a 3-bp deletion when aligned to emm73.7. This looks to be an artifact caused by emmtypingtool adding 100bp flanking sequences from a reference genome to the 180bp trimmed fragment. The added flanking sequences appear to have caused a change in alignment for a majority of the reads resulting in the 3 bp deletion at the 3' end of the 180 bp region. For SAMN26504550, the extra emm type 149.0 is a known emm-like gene (according to the source code for emmtyper), so this looks like 149.0 should have been excluded.
At least for this dataset, emmtyper seemed to be more accurate since it matched all of the expected emm types.
Suggested Scenarios for Reviewer to Test
I did not test TheiaProk_ONT_PHB workflow.
Theiagen Version Release Testing (optional)
:microscope: Final Developer Checklist
🎯 Reviewer Checklist
🗂️ Associated Documentation (to be completed by Theiagen developer)