theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
33 stars 15 forks source link

Percent Gene Coverage Task Modularization #341

Closed sage-wright closed 3 months ago

sage-wright commented 4 months ago

This PR closes #325 and closes #312

🗑️ This dev branch should be deleted after merging to main.

~Currently waiting on the organism parameter PR merging before adding this since it will use that workflow.~

:brain: Aim, Context and Functionality

As TheiaCoV expands to include more organisms, having a WDL task that is hard-coded for a single organism is inefficient if we want to mimic the behavior for other organisms. This PR changes the calculation of breadth of coverage to no longer be hard-coded and is now organism-agnostic. This requires the usage of the organism_parameter logic subworklow and also enables the user to specify the particular regions they want to be listed in the output file by overwriting the default.

Default bed files are currently only provided for mpox and SC2.

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

The same calculations are used but now require a bed file as input to determine the regions of interest. This input bed file is looped through and samtools depth is used to determine the percentage of the sites that are above the specified minimum depth.

Docker/software or software versions changed:

Databases or database versions changed:

Data processing/commands changed:

File processing changed:

Compute resources changed:

➡️ Inputs

New input: reference_gene_locations_bed which indicates that the gene locations should correspond to the same reference file that was used for alignment. By default, this file is provided for SC2 and mpox. The user can use this input file to overwrite the defaults.

⬅️ Outputs

The sc2_all_genes_percent_coverage file is now est_percent_gene_coverage_tsv as it is no longer SC2 specific.

:test_tube: Testing

Test Dataset

Commandline Testing with MiniWDL or Cromwell (optional)

Terra Testing

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

cimendes commented 3 months ago

🥇 for adding tests and tweaks to better adhere to the style-guide!

cimendes commented 3 months ago

Testing MPOX and SC2 here:

cimendes commented 3 months ago

The two failures in ONT data are unrelated to the changes in this PR (And should be removed from the validation dataset!) https://app.terra.bio/#workspaces/cdc-terra-resources/Theiagen_Wright_SC2_Sandbox/job_history/22c272cd-9d35-4d00-bf2d-b98f5928f882

cimendes commented 3 months ago

⚠️ https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/210b6ce7-c37a-4435-a15f-8dc7c5da5d8e

The task runs successfully but the depth coverage is not being captured correctly: image

https://job-manager.dsde-prod.broadinstitute.org/jobs/9fc8388d-4c18-4147-b5e2-55183de96ee3

cimendes commented 3 months ago

image

sage-wright commented 3 months ago

@cimendes Issue resolved!

cimendes commented 3 months ago

All fixed! :D image

Thank you @sage-wright!