theiagen / public_health_viral_genomics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of viral pathogens of concern, especially SARS-CoV-2
https://public-health-viral-genomics-theiagen.readthedocs.io/
GNU Affero General Public License v3.0
39 stars 17 forks source link

kraken2 ENHANCE! (redirect kraken2 output to /dev/null; remove --classified-out options; give more threads) #28

Closed kapsakcj closed 3 years ago

kapsakcj commented 3 years ago

I think kraken2 tasks could be sped up greatly with some tweaks to the command: https://github.com/theiagen/public_health_viral_genomics/blob/5a3d1f7510f4c57b8602049469b6d0329ca5c430/tasks/task_taxonID.wdl#L21

First would be to remove the --classified-out option. I don't believe classified reads are used downstream, correct? If they are, ignore this suggestion.

Second would be to redirect the STDOUT to /dev/null. The node/VM running the task doesn't need to print kraken2's verbose STDOUT or save it to the log so it saves A LOT of runtime and a good bit of diskspace.

I recently saw a kraken2 log that was ~66MB. That's too big 🥴

Without >/dev/null:

$ time kraken2 --classified-out 3000112549.cseqs.fq --threads 4 --db /kraken2-db 3000112549_S11_L001_R1_001.fastq.gz \
--report 3000112549_kraken2_report.txt

<tons of STDOUT removed>

6486078 sequences (236.88 Mbp) processed in 139.402s (2791.7 Kseq/m, 101.96 Mbp/m).
  6074768 sequences classified (93.66%)
  411310 sequences unclassified (6.34%)

real    2m27.289s
user    0m49.805s
sys     0m41.093s

with >/dev/null with --classified-out, an order of magnitude faster

$ time kraken2 --classified-out 3000112549.cseqs.fq --threads 4 --db /kraken2-db 3000112549_S11_L001_R1_001.fastq.gz \
--report 3000112549_kraken2_report.txt >/dev/null
Loading database information... done.
6486078 sequences (236.88 Mbp) processed in 10.971s (35471.4 Kseq/m, 1295.47 Mbp/m).
  6074768 sequences classified (93.66%)
  411310 sequences unclassified (6.34%)

real    0m18.354s
user    0m36.047s
sys     0m6.618s

with >/dev/null without --classified-out:

$ time kraken2 --threads 4 --db /kraken2-db 3000112549_S11_L001_R1_001.fastq.gz \
 --report 3000112549_kraken2_report.txt >/dev/null
Loading database information... done.
6486078 sequences (236.88 Mbp) processed in 12.147s (32038.7 Kseq/m, 1170.10 Mbp/m).
  6074768 sequences classified (93.66%)
  411310 sequences unclassified (6.34%)

real    0m24.753s
user    0m37.853s
sys     0m16.056s

^Removing --classified-out actually seems to make things a little bit slower, but still no sense in writing the file if we don't intend to use it.

And thirdly kraken2 does benefit from extra cpus, I would throw 8 cpus (max) at the task, and scale RAM accordingly if Terra/Cromwell doesn't do it for you:

$ time kraken2 --threads 8 --db /kraken2-db 3000112549_S11_L001_R1_001.fastq.gz \
--report 3000112549_kraken2_report.txt >/dev/null
Loading database information... done.
6486078 sequences (236.88 Mbp) processed in 11.352s (34281.1 Kseq/m, 1252.00 Mbp/m).
  6074768 sequences classified (93.66%)
  411310 sequences unclassified (6.34%)

real    0m15.075s
user    0m38.086s
sys     0m5.574s
kapsakcj commented 3 years ago

I can start a dev branch for this unless someone beats me to it

kevinlibuit commented 3 years ago

Fixed in #47 merge