theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

[TheiaProk] Add abricate module for vibrio characterization #429

Closed cimendes closed 3 months ago

cimendes commented 5 months ago

This PR closes #395

🗑️ This dev branch should NOT be deleted after merging to main.

:brain: Aim, Context and Functionality

This PR adds a new Vibrio-specific abricate task for genomic characterization. It relies on a species-specific database that is packaged in the us-docker.pkg.dev/general-theiagen/internal/abricate:1.0.1-vibrio-cholera container. The PR for this container, including the database, is available at https://github.com/StaPH-B/docker-builds/pull/963

In short, if gambit determines the species as being Vibrio or Vibrio cholerae, it runs the abricate_vibrio task on the assembly. The results are populated to the sample-level datatable with the prefix abricate_vibrio prefix. They are also populated to the taxon table if this functionality is activated.

Within the abricate_vibrio task, the abricate output file is parsed to determine the following:

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes if analysing vibrio data

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: Nothing on pre-existing components of TheiaProk. A new module has been added specific for Vibrio

Databases or database versions changed: Nothing on pre-existing components of TheiaProk. A new module has been added specific for Vibrio

Data processing/commands changed: Nothing on pre-existing components of TheiaProk. A new module has been added specific for Vibrio

File processing changed: Nothing on pre-existing components of TheiaProk. A new module has been added specific for Vibrio

Compute resources changed: Nothing on pre-existing components of TheiaProk. A new module has been added specific for Vibrio

➡️ Inputs

Exposed through merlin_magic:

abricate_vibrio_mincov
abricate_vibrio_minid

⬅️ Outputs

abricate_vibrio_detailed_tsv
abricate_vibrio_database
abricate_vibrio_version
abricate_vibrio_ctxA
abricate_vibrio_ompW
abricate_vibrio_toxR
abricate_vibrio_biotype
abricate_vibrio_serogroup

:test_tube: Testing

Test Dataset

Commandline Testing with MiniWDL or Cromwell (optional)

Terra Testing

Illumina dataset: From https://journals.asm.org/doi/full/10.1128/jcm.00831-18 The study undertook characterization of Vibrio cholerae strains isolated between April 2004 and March 2018 and held at the Public Health England culture archive. The publication reports traditional biochemical species identification and serological typing results and genome-derived species identification and serotyping for a subset of the isolates. The data includes samples from different biotypes, serogroups, and V. cholerae and non-cholera Vibrio species. True Positive Rate (TPR) of 0.9-1.0. Terra: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/df0be5c2-234c-420e-a8d0-f680ebd20779

ONT dataset: from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10996759/ All samples confirmed using conventional PCR to be toxigenic V. cholerae and to be serogroup O1 ONT: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/d151f49b-b137-4780-9c5c-d6afea8628e4

Suggested Scenarios for Reviewer to Test

Perhaps additional ONT testing with O139 serogroup and tcpA_Classical biotype may be great. Also a great idea to take some of the assemblies from either or both Illumina and ONT datasets above and run through TheiaProk_FASTA

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)


jrotieno commented 5 months ago

@cimendes I suggest we change abricate_vibrio_abricate_tsv to abricate_vibrio_detailed_tsv for coherence with SRST2's srst2_vibrio_detailed_tsv

emmadoughty commented 4 months ago

Ran TheiaProk_Illumina_PE, ONT and FASTA; all workflows ran as anticipated. Comments:

cimendes commented 4 months ago

Ran TheiaProk_Illumina_PE, ONT and FASTA; all workflows ran as anticipated. Comments:

  • Abricate database is simply set to vibrio. It would be nice to add a version for this- in case we add to or change the db in the future
  • There is some discordance between abricate and srst2 results from Illumina data
  • There is a discrepancy coverage and ID thresholds being used between the two vibrio characterization modules (see below). Have abricate thresholds of cov =80 and id =80 been tested (as consistent with srst2)?
    Int srst2_min_cov = 80
    Int srst2_max_divergence = 20
    Int abricate_vibrio_minid = 70
    Int abricate_vibrio_mincov = 60

Thank you so much for looking over this PR! Indeed the abricate module has not been tested, as far as I'm aware, with anything other than default values (minid of 70 and mincov of 60) (tagging @jrotieno as he did most of the testing). These values were taken from the abricate task for A. baumanii and I didn't give it much though. By default abricate has them as 80 80. Would it be to laborious for me to set these values as per abricate defaults and then retest?

jrotieno commented 3 months ago

Hi @emmadoughty, the following changes have been made: abricate_vibrio - minid and mincov set to the default 80 for both

tests done here: Illumina_PE: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/2c9e23ef-2a0b-4bdf-9eca-78d520c94722 For two samples, it appears that changing the two default values have had an impact on the biotype detection.

sample abricate_vibrio_biotype abricate_vibrio_biotype_old
SRR7062511 (not detected) tcpA_ElTor
SRR7062612 (not detected) tcpA_classical

re-run with old defaults to see if we get the old values: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/2fd82255-536a-4871-93fd-e33149e985b9, and indeed we get the old results

Also these samples had different ompW results sample abricate_vibrio_ompW abricate_vibrio_ompW_old
SRR7062587 (not detected) present
SRR7637792 (not detected) present
SRR7637793 (not detected) present
SRR7637794 (not detected) present
SRR7637796 (not detected) present
SRR7637798 (not detected) present
SRR7637799 (not detected) present

Different results with toxR

sample abricate_vibrio_toxR abricate_vibrio_toxR_old
SRR7062587 (not detected) present
SRR7637794 (not detected) present
SRR7062522 (not detected) present
SRR7062523 (not detected) present
SRR7062525 (not detected) present
SRR7062551 (not detected) present
SRR7637797 (not detected) present

Did not expect differences between the previous and current srst2 runs, and there were none

Differences between abricate_vibrio and srst2: sample abricate_vibrio_ctxA srst2_vibrio_ctxA
SRR7062576 (not detected) present
SRR7062601 (not detected) present (low depth/uncertain)
None of the below had different results when thresholds were changed above sample abricate_vibrio_ompW srst2_vibrio_ompW
SRR7637797 present (not detected)
SRR7062519 present (not detected)
SRR7062552 present (not detected)
SRR7062592 present (not detected)
SRR7062631 (not detected) present

Note that samples SRR7062522, SRR7062523, SRR7062525, and SRR7062551 below similarly had different results when thresholds were changed above

sample abricate_vibrio_toxR srst2_vibrio_toxR
SRR7062522 (not detected) present (low depth/uncertain)
SRR7062523 (not detected) present
SRR7062525 (not detected) present
SRR7062551 (not detected) present
SRR7062539 (not detected) present

ONT: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/5e179757-9fe5-44bc-8b1a-5b290a415fb2 No differences observed when thresholds were changed.

emmadoughty commented 3 months ago

Thanks, James. This is extremely helpful!

Note, samples SRR7062511 and SRR7062612 have the same results with both abricate implementations, but different results with SRST2.

Looking at the new results, these are more concordant with the SRST2 results (with the exception of results for toxR detection).

Thresholds for min ID and min coverage may need to be optimized but this will require a gold standard. Gold standard results may be taken from the PHE paper, or partner labs