theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

[TheiaProk] Adds stxtyper to merlin_magic and TheiaProk wfs #525

Open kapsakcj opened 3 months ago

kapsakcj commented 3 months ago

Please keep this PR as draft for now as stxtyper is still undergoing validation/peer review/publication

Using this draft PR for tracking stxtyper development as well as amrfinderplus (that runs stxtyper under the hood).

Our partners are actively using this branch for testing purposes.

I'll try to keep this branch up-to-date with main to incorporate other changes & resolve merge conflicts if they arise.

This PR closes #443

🗑️ This dev branch should NOT be deleted after merging to main.

2024-09-23 update: I expect to hear feedback from our partner soon, but updating PR message now with tests and info

:brain: Aim, Context and Functionality

This PR adds Stxtyper to the TheiaProk workflows. Stxtyper is used to detect and type shiga toxin genes in bacterial genome assemblies. It also attempts to detect novel shiga toxin subtypes in cases where the detected sequences diverge from the reference sequences.

These genes are usually found in E. coli (STEC), but can also be found in Shigella species as well as some other genera more rarely, like Klebsiella. It is developed by NCBI in collaboration with a number of different groups including CDC, FDA, SSI, and others. A publication to fully describe the tool and it's validation is in the works but a software release has been made so the community may test the software further and begin using the tool.

This tool queries genome assemblies for 2 genes or subunits involved in shiga toxin production, stxA and stxB. The A subunit is longer than the B subunit. Stxtyper attempts to detect these, compare them to a database of known sequences, and type them based on amino acid composition. The typing algorithm will be described in the publication when it is published.

More info & source code found here: https://github.com/ncbi/stxtyper

To learn more about shiga toxin subtypes and the description of the latest subtypes, Stx2n, Stx2j, Stx2m, and Stx2o, see this publication (shamless plug): https://www.mdpi.com/2076-2607/11/10/2561

Eventually this tool will be incorporated into AMRFinderPlus and will run behind-the-scenes when the user provides the amrfinder --organism Escherichia option, but we wanted the functionality now and the ability to run separate from AMRFinderPlus.

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes/No

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : Yes/No

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed:

Databases or database versions changed:

Data processing/commands changed:

File processing changed:

Compute resources changed:

➡️ Inputs

⬅️ Outputs

:test_tube: Testing

Test Dataset

Commandline Testing with MiniWDL or Cromwell (optional)

Terra Testing

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

kapsakcj commented 3 months ago

other TODOs:

kapsakcj commented 2 months ago

Waiting on user feedback prior to making more code changes

Plan as of 2024-07-10:

This will allow organisms of any genus/species to be screen for stx genes since they can occur in more genera/species other than E. coli & Shigella

kapsakcj commented 2 months ago

Also - adjust conditional in merlin_magic code to so that user can "opt-in" to running stxtyper, regardless of the taxa (i.e. gambit_predicted_taxon).

That way stxtyper is run automatically on all E. coli and Shigella and user has the ability to run it on other taxa.

kapsakcj commented 1 week ago

Successfully ran stxtyper on 1 A. baumm and 1 Burkolderia cepacia genome using the call_stxtyper optional input Boolean. https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-Aug2024/job_history/df49fd54-2614-4c0d-a21b-914ca40da962

Awaiting feedback from our PH partner, and will update remaining TheiaProk workflows & CI after making any further adjustments/changes