theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

V. cholerae O1/O139 mlst scheme selection bugfix #333

Closed kapsakcj closed 7 months ago

kapsakcj commented 8 months ago

This PR closes #58

πŸ—‘οΈ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

Vibrio cholerae has two MLST schemes, one for O1 or O139 serogroups, and one for non-O1/non-O139 serogroups. The latter is always being used regardless of serogroup, because mlst's software auto scheme detection feature seems to always choose the non-O1 & non-O139 scheme. mlst also has a few default schemes that are excluded by default and vcholerae_2 scheme is one of them, so perhaps that is why vcholerae is default.

The mlst command-line tool calls these schemes as such:

By default, the mlst auto-scheme detection chooses the non-O1 & O139 scheme so this is a way of automatically forcing mlst to use vcholerae_2 for samples detected as O1 or O139 via the SRST2 task.

More info on the 2 mlst schemes can be found here: https://pubmlst.org/organisms/vibrio-cholerae

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : No, but the exception is if they are analyzing V. cholerae samples that are O1 or O139 serogroups. The mlst results will change if analyzed w/ a previous version of TheiaProk wf

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

πŸ”„ Data Processing

Docker/software or software versions changed: No

Databases or database versions changed: No

Data processing/commands changed: No

File processing changed: No

Compute resources changed: No

➑️ Inputs

⬅️ Outputs

:test_tube: Testing

Test Dataset

Commandline Testing with MiniWDL or Cromwell (optional)

Terra Testing

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

πŸ—‚οΈ Associated Documentation (to be completed by Theiagen developer)

kapsakcj commented 8 months ago

TODO: need to update theiaprok ILMN SE, ONT, and FASTA workflows to get mlst scheme input from merlin_magic subworkflow.

Testing that this works for ILMN PE in Terra now

kapsakcj commented 7 months ago

Tests launched & comments regarding them:

TODO:

Functional test results:

kapsakcj commented 7 months ago

One last test to show that user-defined input is priority in the select_first statement in merlin_magic subworkflow.

I used "senterica" as the input mlst_scheme variable for a few E. coli and the mlst task used senterica as the scheme:

https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/ddd76e8c-fc7e-467d-a3e5-307cec43020c

emmadoughty commented 7 months ago

I have looked deeper into the literature given Curtis' observation that no ST was assigned to the O1 and O139 samples with the vcholerae_2 scheme.

image

Though PubMLST lists the vcholerae_2 scheme as the O1 and O139 scheme, it may be more useful for the public health community to use the vcholerae scheme as they are more likely to identify a named ST for their V. cholerae, including O1 and O139 isolates, and the use of this scheme doesn't seem to be wrong- in fact, there seems to be a consensus to preferentially using this.

cimendes commented 7 months ago

Thank you @emmadoughty for the detective work πŸ•΅πŸ»β€β™€οΈ

I agree with your assessment that the community is using vcholerae scheme for all isolates, regardless of serotype. I think we should close off this PR and not incorporate these changes into the main codebase to keep in line with the community standards and expectations.