theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
33 stars 15 forks source link

IRMA bug fixes & improvements; theiacov_illumina_pe wf updates for Flu #468

Closed kapsakcj closed 2 weeks ago

kapsakcj commented 1 month ago

Starting a draft while doing testing in Terra. Will update this message periodically

This PR closes #412 closes #437 and closes #457

🗑️ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

The aim of this PR is to resolve a few different issues/bugs and make general improvements & upgrades to the TheiaCoV workflows for Flu analysis. Much of these changes impact the IRMA task, used in TheiaCoV_Illumina_PE wf, but some changes impact other workflows like TheiaCoV_ONT for Flu analysis.

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: cdcgov/irma:v1.1.3 ➡️ "us-docker.pkg.dev/general-theiagen/cdcgov/irma:v1.1.5"

Databases or database versions changed: N/A

Data processing/commands changed: lots of changes to IRMA task & data processing

File processing changed: lots of changes to output files & file renaming & FASTA header replacement

Compute resources changed:

➡️ Inputs

⬅️ Outputs

New IRMA task outputs:

New TheiaCoV_Illumina_PE & ONT outputs:

⚠️ NOTE: I have opted to NOT output the padded FASTA files to the workflow level (i.e. Terra output column) as it would add lots of clutter and likely confuse the user. They are saved as intermediate files during the IRMA task and can be retrieved from the execution directory if necessary.

:test_tube: Testing

Test Dataset

Described below.

Commandline Testing with MiniWDL or Cromwell (optional)

tested lots with miniwdl locally, but don't have output saved.

Terra Testing

Suggested Scenarios for Reviewer to Test

Would be good to test as many types/subtypes as possible and even test with poor quality data to see how the workflow behaves when IRMA cannot produce an assembly.

Need to test ONT as I've primarily been testing the IRMA WDL task updates with Illumina data. I tested 5 ONT samples, but would be good to do more if data is available

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

kapsakcj commented 2 weeks ago

OK PR is ready for review.

I'll update the documentation and check the above box when I'm finished.

@sage-wright FYI there was one additional commit made after you forked your branch, you may want to merge that in if you are working on theiacov_ont workflow

kapsakcj commented 2 weeks ago

docs have been updated 👍 (mostly updates to theiacov inputs and outputs table)

sage-wright commented 2 weeks ago

TheiaProk_Illumina_PE here success TheiaProk_ONT here success

kapsakcj commented 2 weeks ago

I relaunched Sage's tests since they were run prior to the last 2 commits. Same samples, call caching off, both run on the cjk-irma-diskcheck branch

⚠️ Need to review outputs, esp. the sample that previously had a - in the output HA segment FASTA file and all segment FASTA file

cimendes commented 2 weeks ago

Code changes look solid! 🏅

New outputs present as expected: image

Launching a new set of tests as I don't have access to Sage's workspace:

cimendes commented 2 weeks ago

My tests were successful and the workflow is working as expected. @jrotieno you're the main reviewer here but you got my okay!