theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
36 stars 17 forks source link

output unaligned FASTQ files TheiaCov_Illumina PE and SE #275

Closed kapsakcj closed 6 months ago

kapsakcj commented 8 months ago

Closes #242

This dev branch should NOT be deleted after merging, we need to notify Alex E. and recommend switching to main branch for further usage before deleting it

:hammer_and_wrench: Changes Being Made

changes to tasks/alignment/task_bwa.wdl

Changes to workflows/theiacov/wf_theiacov_illumina_pe.wdl and workflows/theiacov/wf_theiacov_illumina_se.wdl and workflows/utilities/wf_ivar_consensus.wdl

Impacted Workflows/Tasks

:brain: Context and Rationale

A user requested that unaligned FASTQ files (i.e. the reads that do not align to the target reference genome, example: Mpox) are output from the TheiaCov_Illumina workflows. The goal is to take these reads that did not align to the reference and perform downstream analysis on them (like Kraken for taxonomic assignment, or other purposes like attempting to assemble a genome out of those reads not aligned)

In the process of addressing this request, I also added the unaligned_bam and it's index which is a BAM file that only contains reads that did not align to the reference (see lines 59-64 in BWA WDL task to see where this BAM is created). Not sure how this would be used, but it doesn't hurt to output this as well.

:clipboard: Workflow/Task Steps

Inputs

added 2 new optional inputs to bwa task: memory and docker

Outputs

Impacted Outputs

Pre-existing outputs that may be impacted (this excludes the new ones)

:test_tube: Testing

Data

Test data used were as follows

  1. Simulated SARS-CoV-2 data. This was expected to map 100% to the reference SARS-CoV-2 genome, i.e. 0 unaligned reads.
  2. Human data. This was expected not to align, but some reads aligned to the the reference SARS-CoV-2 genome.
  3. A spiked dataset from 1. and 2. above. SARS-CoV-2 and some human reads expected to map.
  4. Unaligned reads from 2. This was to test cases when nothing aligns to a reference, in which case unaligned reads file will be created but empty.

Locally

Worked as expected

Terra

https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/Global_tree_testing/job_history/43118990-5338-4d92-b429-19158452cf78

Scenarios for Reviewer to Test

:microscope: Quality checks

Pull Request (PR) checklist:

kapsakcj commented 7 months ago

NOTE: this change will impact Freyja_FASTQ workflow as it used the same bwa task

So, we are planning to skip adding these new outputs to Freyja_FASTQ workflow, but it would be good to do a functional validation test of the workflow to ensure it still runs as expected

I am worried that this may change the results/outputs of Freyja_FASTQ too, so it would be good to actually compare results instead of just a functional workflow test

kapsakcj commented 7 months ago

I launched a series of tests on some 2 sars-cov-2 Illumina samples. hoping to compare outputs between the 2 (version1.3.0 vs this dev branch.

and single end:

TODO:

The BAMs are identical in size, as expected

The aligned FASTQs (like the R1 file) for v1.3.0 are slightly larger than those on the dev branch. I believe this is because for the dev branch, the samtools fastq command for outputting aligned reads also requires that reads have a mate/pair where as v1.3.0 did not. So this is to be expected, but is a minor difference from v1.3.0 behavior

All of these QC metrics were identical. Meaning that the sorted BAM used for variant calling/consensus FASTA generation are identical to v1.3.0 and downstream analysis is not impacted by these changes. Woo! 🎉

cimendes commented 7 months ago

Workflows using bwa:

PR #323 will implement the use of bwa on TheiaMeta_Illumina_PE (aligned bams are input for the new binning task)

cimendes commented 7 months ago

tried running theiavalidate on two runs of Freyja_FASTQ but got stuck due to unrelated issues with this PR.

A manual size comparison of the outputted bam files confirmed that the two are of the same size: image image

I would like to get theiavalidate results to properly check file content but given that it's not cooperating, and this is an issue that is high priority, I'm not going to hold my review on it.

I'll wait for more feedback on the style-guide related issues I mentioned above before hitting approve but functionally this addition is working as expected for all tested workflows. Nicely done @kapsakcj and @jrotieno!

kapsakcj commented 7 months ago

FYI @jrotieno @cimendes I've set this PR to draft until we hear feedback from our partner. Let's not merge until they test and provide feedback.

If we don't get any feedback in the next week or 2, then I say we press forward with merging if we're satisfied with our own testing.

It would be good to incorporate this into the next PHB release