theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
33 stars 15 forks source link

[TheiaProk] Add emmtyper task for Streptococcus pyogenes #524

Open sam-baird opened 3 days ago

sam-baird commented 3 days ago

This PR closes #413.

🗑️ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

This PR adds emmtyper to the TheiaProk workflows to perform emm gene typing on Streptococcus pyogenes assemblies. TheiaProk_Illumina_PE_PHB already contains a similar task using the read-based emmtyping_tool, which is only compatible with paired-end reads. Since emmtyper works on assemblies, it is compatible with the paired-end, single-end, FASTA, and ONT versions of the TheiaProk workflows. Also, there are sometimes differences between the two tools' outputs (see validation test results below), so I think it is useful to make both tools available.

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes, there are now a few additional outputs that will be filled if the species is identified as Streptococcus pyogenes, but the change will not affect existing results.

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

task_emmtyper.wdl already existed in this repo, but was not being used in any workflows. I made some changes to the existing task and updated the TheiaProk workflows to use the task.

🔄 Data Processing

Docker/software or software versions changed: None

Databases or database versions changed: None

Data processing/commands changed: The task now parses and outputs the "final" emm type from the output TSV. The task now uses --output-format verbose TSV format instead of the default short format.

File processing changed: None

Compute resources changed: None

➡️ Inputs

None. #### ⬅️ Outputs

The TheiaProk workflows have these new outputs: emmtyper_emm_type, emmtyper_results_tsv, emmtyper_version, and emmtyper_docker.

:test_tube: Testing

Test Dataset

The workflows were tested using 32-sample validation set of isolates with known emm types provided by the CDC Strep Lab. The reads were fetched from NCBI using the SRA_Fetch_PHB workflow.

Commandline Testing with MiniWDL or Cromwell (optional)

None.

Terra Testing

I performed Terra testing for the TheiaProk_Illumina_PE_PHB, TheiaProk_Illumina_SE_PHB, and TheiaProk_FASTA_PHB workflows in separate data tables with call caching turned off. For testing TheiaProk_Illumina_SE_PHB, only read1 was used. For testing TheiaProk_FASTA_PHB, the assembly_fasta outputs from TheiaProk_Illumina_PE were used.

image

Comparison table of emm type results BioSample Expected emm type TheiaProk PE emmtyper TheiaProk PE emmtypingtool TheiaProk SE emmtyper TheiaProk FASTA emmtyper
SAMN26504544 83.1 EMM83.1 emm83.1.sds EMM83.1 EMM83.1
SAMN26504545 76.0 EMM76.0 emm76.0.sds EMM76.0 EMM76.0
SAMN26504546 4.0 EMM4.0 emm4.0.sds EMM4.0 EMM4.0
SAMN26504547 73.7 EMM73.7 emm73 ⚠️ EMM73.7 EMM73.7
SAMN26504548 12.0 EMM12.0 emm12.0.sds EMM12.0 EMM12.0
SAMN26504549 4.0 EMM4.0 emm4.0.sds EMM4.0 EMM4.0
SAMN26504550 183.1 EMM183.1 emm183.1.sds/emm149.0.sds ⚠️ EMM183.1 EMM183.1
SAMN26504551 77.0 EMM77.0 emm77.0.sds EMM77.0 EMM77.0
SAMN26504568 28.0 EMM28.0 emm28.0.sds EMM28.0 EMM28.0
SAMN26504569 2.0 EMM2.0 emm2.0.sds EMM2.0 EMM2.0
SAMN26504552 12.0 EMM12.0 emm12.0.sds EMM12.0 EMM12.0
SAMN26504553 89.0 EMM89.0 emm89.0.sds EMM89.0 EMM89.0
SAMN26504554 89.0 EMM89.0 emm89.0.sds EMM89.0 EMM89.0
SAMN26504570 2.0 EMM2.0 emm2.0.sds EMM2.0 EMM2.0
SAMN26504571 2.0 EMM2.0 emm2.0.sds EMM2.0 EMM2.0
SAMN26504555 75.0 EMM75.0 emm75.0.sds EMM75.0 EMM75.0
SAMN26504556 77.0 EMM77.0 emm77.0.sds EMM77.0 EMM77.0
SAMN26504557 77.0 EMM77.0 emm77.0.sds EMM77.0 EMM77.0
SAMN26504558 92.0 EMM92.0 emm92.0.sds EMM92.0 EMM92.0
SAMN26504559 92.0 EMM92.0 emm92.0.sds EMM92.0 EMM92.0
SAMN26504560 90.2 EMM90.2 emm90.2.sds EMM90.2 EMM90.2
SAMN26504562 59.0 EMM59.0 emm59.0.sds EMM59.0 EMM59.0
SAMN26504563 83.1 EMM83.1 emm83.1.sds EMM83.1 EMM83.1
SAMN26504564 77.0 EMM77.0 emm77.0.sds EMM77.0 EMM77.0
SAMN26504572 11.0 EMM11.0 emm11.0.sds EMM11.0 EMM11.0
SAMN26504573 11.0 EMM11.0 emm11.0.sds EMM11.0 EMM11.0
SAMN26504565 28.0 EMM28.0 emm28.0.sds EMM28.0 EMM28.0
SAMN26504566 92.0 EMM92.0 emm92.0.sds EMM92.0 EMM92.0
SAMN26504567 73.0 EMM73.0 emm73.0.sds EMM73.0 EMM73.0
SAMN26504574 1.0 EMM1.0 emm1.0.sds EMM1.0 EMM1.0
SAMN26504575 6.4 EMM6.4 emm6.4.sds EMM6.4 EMM6.4
SAMN26504576 89.0 EMM89.0 emm89.0.sds EMM89.0 EMM89.0

⚠️: SAMN26504547 and SAMN26504550 showed different results between emmtypingtool and emmtyper. Specifically, emmtypingtool did not report the subtype for SAMN26504547, reporting "73" instead of "73.7". emmtypingtool reported two subtypes "183.1/149.0" instead of "183.1" for SAMN26504550. The SAMN26504547 emmtypingtool BAM alignment showed a 3-bp deletion when aligned to emm73.7. This looks to be an artifact caused by emmtypingtool adding 100bp flanking sequences from a reference genome to the 180bp trimmed fragment. The added flanking sequences appear to have caused a change in alignment for a majority of the reads resulting in the 3 bp deletion at the 3' end of the 180 bp region. For SAMN26504550, the extra emm type 149.0 is a known emm-like gene (according to the source code for emmtyper), so this looks like 149.0 should have been excluded.

At least for this dataset, emmtyper seemed to be more accurate since it matched all of the expected emm types.

Suggested Scenarios for Reviewer to Test

I did not test TheiaProk_ONT_PHB workflow.

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

sage-wright commented 1 hour ago

Hey Sam,

Thanks for this really well written PR! I'm currently running some additional tests on Terra (including TheiaProk_ONT!), and I'm planning on merging this PR upon their successful completion/results!