theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

Expose midas secondary genus absolute coverage #257

Closed michellescribner closed 7 months ago

michellescribner commented 10 months ago

Closes https://github.com/theiagen/public_health_bioinformatics/issues/247

:hammer_and_wrench: Changes Being Made

Impacted Workflows/Tasks

./tasks/taxon_id/task_midas.wdl ./tasks/quality_control/task_qc_check.wdl ./tasks/utilities/task_broad_terra_tools.wdl ./workflows/theiaprok/wf_theiaprok_illumina_se.wdl ./workflows/theiaprok/wf_theiaprok_illumina_pe.wdl ./workflows/utilities/wf_read_QC_trim_se.wdl ./workflows/utilities/wf_read_QC_trim_pe.wdl

:brain: Context and Rationale

This PR exposes the absolute coverage of the second-most prevalent genus detected by MIDAS in a sample. It also makes MIDAS a default module so that contamination is more easily identifiable.

:clipboard: Workflow/Task Steps

The MIDAS task performs the following steps:

  1. orders the MIDAS report by the relative abundance column
  2. identifies the most prevalent genus by relative abundance
  3. identifies the second-most prevalent genus by relative abundance
  4. flags if the relative abundance of the second-most prevalent genus is above 0.01 (1%)

In addition to these steps, the task will now also:

  1. report the absolute coverage of the second-most prevalent genus as shown in the "coverage" column below.
species_id count_reads coverage relative_abundance
Salmonella_enterica_58156 3309 89.88006645 0.855888033
Salmonella_enterica_58266 501 11.60606061 0.110519371
Salmonella_enterica_53987 99 2.232896237 0.021262881
Citrobacter_youngae_61659 46 0.995216227 0.009477003
Escherichia_coli_58110 5 0.123668877 0.001177644

This PR also makes the MIDAS module default in TheiaProk_Illumina_PE_PHB and TheiaProk_Illumina_SE_PHB. However, the workflow can be turned off by setting call_midas to false to conserve compute resources and time.

Finally, this PR adds the string input workflow_series to the TheiaProk PE and SE workflows with the default value of "theiaprok" for consistency with the TheiaProk ONT workflow. This input is passed to read_qc_trim_pe and read_qc_trim_se, respectively, to ensure that the MIDAS task is only run for the TheiaProk workflows.

Inputs

No workflow inputs

Outputs

midas_secondary_genus_coverage

Impacted Outputs

:test_tube: Testing

Locally

No local testing was performed

Terra

TheiaProk_Illumina_PE_PHB testing on 55 enteric samples: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/27465535-a7d4-42a4-a979-c880ea737019

Scenarios for Reviewer to Test

TheiaProk_Illumina_SE_PHB still needs testing

:microscope: Quality checks

Pull Request (PR) checklist:

sage-wright commented 7 months ago

Currently testing PE here and SE here.

Will approve upon success.