theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

Add abricate as optional module #431

Closed jrotieno closed 4 months ago

jrotieno commented 5 months ago

This PR closes #398.

🗑️ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

Currently on the TheiaProk workflows and prior to this PR, only the "AcinetobacterPlasmidTyping" and "vibrio" abricate databases would be called when the gambit predicted pathogens are "Acinetobacter baumannii" and "Vibrio" or "Vibrio cholerae", respectively.

This PR adds an optional abricate module for genomic characterization. With this module, a user could specify whether they want to run abricate as part of their TheiaProk analysis by setting the call_abricate optional input parameter to true, default is false, with the additional option to specify the abricate database through the optional input parameter abricate_db, with the default database set to "vfdb"

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : No

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: No

Databases or database versions changed: No

Data processing/commands changed: Yes, new inputs and outputs

File processing changed:Yes, new inputs and outputs

Compute resources changed: No

➡️ Inputs

call_abricate
abricate_db

⬅️ Outputs

abricate_results_tsv
abricate_genes
abricate_database
abricate_version
abricate_docker

:test_tube: Testing

Test Dataset

Commandline Testing with MiniWDL or Cromwell (optional)

Terra Testing

C. diphtheria samples to check for toxin genes in vfdb in both PE and SE: C. diphtheriae Illumina PE: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/1a071907-3f92-429d-aebd-eed796b32d80 C. diphtheriae Illumina SE: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/171a8575-fc99-4051-9bcd-df4885cbeaa2

V. cholerae: to compare the optional module with vfdb identified gene, srst2 genes and amrfinderplus_virulence_genes, such as ctxA A. baummannii: to compare the the optional module with vfdb identified genes and the default abricate_abaum genes E. coli: to compare the the optional module with vfdb identified genes and amrfinderplus_virulence_genes V. cholerae, A. baummannii and E.Coli Illumina PE: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/f1a612b8-2966-4dec-8ac6-7f00ff263050 V. cholerae ONT: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/29fc26b8-6605-43b9-8680-6d9ef1a6f784

Suggested Scenarios for Reviewer to Test

Additional pathogens with alternative abricate databases to vfdb

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

michellescribner commented 5 months ago

Add outputs to taxon tables

jrotieno commented 5 months ago

Updated the code such that A. baummannii have different outputs for abricate database, version and docker from the optional outputs. The rationale is that now or in the future the abricate database one would want to run as optional might be different from the one run by merlin magic, even for the same pathogen. Also good for testing if the optional output is working correctly when abricate optional uses the same database as merlin magic. If this is unnecessary, happy to revert.

cimendes commented 4 months ago

Retesting call_abricate true (default vfdb database) on A. Baumannii genomes with TheiaProk_FASTA to verify that output names are not conflicting: