output unaligned FASTQ files TheiaCov_Illumina PE and SE

kapsakcj commented 8 months ago

Closes #242

This dev branch should NOT be deleted after merging, we need to notify Alex E. and recommend switching to main branch for further usage before deleting it

:hammer_and_wrench: Changes Being Made

changes to `tasks/alignment/task_bwa.wdl`

expose memory and docker as optional inputs
added functionality to cope with uncompressed FASTQ files used as input
added lots of echo statements throughout to help with debugging and troubleshooting (we should keep these, they will be useful in the future)
IMPORTANT CHANGES
- lines 41-48 Altered the main bwa command to produce a sorted SAM file as output. Also pass in cpus to speed things up
- lines 51-56 added samtools view command to produce a BAM that ONLY includes aligned reads
- lines 59-64 added samtools view command to produce a BAM that ONLY includes unaligned reads
- lines 74-79 Altered samtools fastq command to produce paired FASTQ files for aligned reads
- lines 82-87 Added samtools fastq command to produced paired FASTQ files for unaligned reads
- lines 89-94 altered samtools fastq command to produce FASTQ file for aligned reads (single end)
- lines 95-101 added samtools fastq command to produce FASTQ file for unaligned reads (single end)
- lines 105-106 added commands for indexing aligned BAM and unaligned BAM
re-enabled memory retry feature (it was commented out, likely by accident)

Changes to `workflows/theiacov/wf_theiacov_illumina_pe.wdl` and `workflows/theiacov/wf_theiacov_illumina_se.wdl` and `workflows/utilities/wf_ivar_consensus.wdl`

added 4 (PE) or 3 (SE) new optional outputs to both theiacov_illumina workflows:

File? read1_unaligned = ivar_consensus.read1_unaligned
File? read2_unaligned = ivar_consensus.read2_unaligned
File? sorted_bam_unaligned = ivar_consensus.sorted_bam_unaligned
File? sorted_bam_unaligned_bai = ivar_consensus.sorted_bam_unaligned_bai

Impacted Workflows/Tasks

TheiaCov_Illumina_PE_PHB (iVar subworkflow, NOT Flu track)
TheiaCov_Illumina_SE_PHB (iVar subworkflow, NOT Flu track)
Freyja_FASTQ (bwa tasked changed, but no workflow changes) - Would be good to compare the results of the same sample analyzed via v1.3.0 vs this dev branch

:brain: Context and Rationale

A user requested that unaligned FASTQ files (i.e. the reads that do not align to the target reference genome, example: Mpox) are output from the TheiaCov_Illumina workflows. The goal is to take these reads that did not align to the reference and perform downstream analysis on them (like Kraken for taxonomic assignment, or other purposes like attempting to assemble a genome out of those reads not aligned)

In the process of addressing this request, I also added the unaligned_bam and it's index which is a BAM file that only contains reads that did not align to the reference (see lines 59-64 in BWA WDL task to see where this BAM is created). Not sure how this would be used, but it doesn't hurt to output this as well.

:clipboard: Workflow/Task Steps

Inputs

added 2 new optional inputs to bwa task: memory and docker

Outputs

added 4 new optional outputs to both theiacov_illumina workflows (served up via the ivar subworkflow):

File? read1_unaligned = ivar_consensus.read1_unaligned
File? read2_unaligned = ivar_consensus.read2_unaligned
File? sorted_bam_unaligned = ivar_consensus.sorted_bam_unaligned
File? sorted_bam_unaligned_bai = ivar_consensus.sorted_bam_unaligned_bai

Impacted Outputs

Pre-existing outputs that may be impacted (this excludes the new ones)

sorted_bam which is used heavily downstream in the workflow
read1_aligned & read2_aligned. Both existed previously, but command syntax has changed slightly so good to double check these.

:test_tube: Testing

Data

Test data used were as follows

Simulated SARS-CoV-2 data. This was expected to map 100% to the reference SARS-CoV-2 genome, i.e. 0 unaligned reads.
Human data. This was expected not to align, but some reads aligned to the the reference SARS-CoV-2 genome.
A spiked dataset from 1. and 2. above. SARS-CoV-2 and some human reads expected to map.
Unaligned reads from 2. This was to test cases when nothing aligns to a reference, in which case unaligned reads file will be created but empty.

Locally

Worked as expected

Terra

https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/Global_tree_testing/job_history/43118990-5338-4d92-b429-19158452cf78

Scenarios for Reviewer to Test

:microscope: Quality checks

Pull Request (PR) checklist:

[X] Include a description of what is in this pull request in this message.
[X] The workflow/task has been tested locally and on Terra
[X] The CI/CD has been adjusted and tests are passing
[X] Everything follows the style guide

kapsakcj commented 7 months ago

NOTE: this change will impact Freyja_FASTQ workflow as it used the same bwa task

So, we are planning to skip adding these new outputs to Freyja_FASTQ workflow, but it would be good to do a ~~functional~~ validation test of the workflow to ensure it still runs as expected

I am worried that this may change the results/outputs of Freyja_FASTQ too, so it would be good to actually compare results instead of just a functional workflow test

kapsakcj commented 7 months ago

I launched a series of tests on some 2 sars-cov-2 Illumina samples. hoping to compare outputs between the 2 (version1.3.0 vs this dev branch.

and single end:

~~TODO:~~

review results, compare these between dev branch and v1.3.0 runs:
- file sizes of aligned BAMs (should be identical or extremely close)

The BAMs are identical in size, as expected

file sizes of aligned FASTQs (should be identical or extremely close)

The aligned FASTQs (like the R1 file) for v1.3.0 are slightly larger than those on the dev branch. I believe this is because for the dev branch, the samtools fastq command for outputting aligned reads also requires that reads have a mate/pair where as v1.3.0 did not. So this is to be expected, but is a minor difference from v1.3.0 behavior

general assembly metrics (number N, assembly_length_unambiguous, VADR alerts,)

All of these QC metrics were identical. Meaning that the sorted BAM used for variant calling/consensus FASTA generation are identical to v1.3.0 and downstream analysis is not impacted by these changes. Woo! 🎉

other things I'm forgetting?

cimendes commented 7 months ago

Workflows using bwa:

TheiaCoV_Illumina_SE and PE through the wf_ivar_consensus.wdl sub-workflow
- Tested by @kapsakcj
Freyja_FASTQ

PR #323 will implement the use of bwa on TheiaMeta_Illumina_PE (aligned bams are input for the new binning task)

cimendes commented 7 months ago

tried running theiavalidate on two runs of Freyja_FASTQ but got stuck due to unrelated issues with this PR.

A manual size comparison of the outputted bam files confirmed that the two are of the same size:

I would like to get theiavalidate results to properly check file content but given that it's not cooperating, and this is an issue that is high priority, I'm not going to hold my review on it.

I'll wait for more feedback on the style-guide related issues I mentioned above before hitting approve but functionally this addition is working as expected for all tested workflows. Nicely done @kapsakcj and @jrotieno!

kapsakcj commented 7 months ago

FYI @jrotieno @cimendes I've set this PR to draft until we hear feedback from our partner. Let's not merge until they test and provide feedback.

If we don't get any feedback in the next week or 2, then I say we press forward with merging if we're satisfied with our own testing.

It would be good to incorporate this into the next PHB release

theiagen / public_health_bioinformatics

output unaligned FASTQ files TheiaCov_Illumina PE and SE #275

:hammer_and_wrench: Changes Being Made

changes to `tasks/alignment/task_bwa.wdl`