nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
410 stars 417 forks source link

Managing CRAM compression level #1182

Open dr-yoon opened 1 year ago

dr-yoon commented 1 year ago

Description of feature

Hi. I appreciate your effort in developing this pipeline. I noticed that the cram file size generated by Sarek is much larger than the one from another pipeline. For example, an identical sample (WGS) produces a cram file of approximately 30 GB (Sarek) compared to 10 GB (other pipeline) in size. It would be nice if we could adjust the compression level of the resulting cram files for more efficient storage. Thank you :)

FriederikeHanssen commented 1 year ago

Hey! Good idea: do you mean for the output only? It seems that there is a bit of a trade off between compression levels and runtime: https://medium.com/@acarroll.dna/looking-at-trade-offs-in-compression-levels-for-genomics-tools-eec2834e8b94

If you run the pipeline with default settings Markduplicates followed by samtools view takes care of the conversion.

Do you know which view flag it is, I don't see it in the docs: http://www.htslib.org/doc/samtools-view.html 😱

dr-yoon commented 1 year ago

Wow, what a super-fast reply! 😱

I think we can tweak this option described here: http://www.htslib.org/doc/samtools.html

level=INT Output only. Specifies the compression level from 1 to 9, or 0 for uncompressed. If the output format is SAM, this also enables BGZF compression, otherwise SAM defaults to uncompressed.

FriederikeHanssen commented 1 year ago

😄 exactly the moment I checked my emails.

Ah great, I was checking directly in the subcommand. Sure that should be no problem to add :)