snakemake-workflows / dna-seq-varlociraptor

A Snakemake workflow for calling small and structural variants under any kind of scenario (tumor/normal, tumor/normal/relapse, germline, pedigree, populations) via the unified statistical model of Varlociraptor.
MIT License
81 stars 36 forks source link

fix: switch mem_gb to mem_mb #262

Closed dlaehnemann closed 10 months ago

dlaehnemann commented 10 months ago

a mem_gb specification plus a default-resources: specification of mem_mb for a cluster system leads to multiple distinct resource definitions that can get confused -- so we should just stick to the standard mem_mb here

dlaehnemann commented 10 months ago

Good point, done.

dlaehnemann commented 10 months ago

Yes, the specification should definitely be dynamic. I was already planning this but hadn't gotten around to looking into a useful calculation. Here's my reasoning for what I now suggest:

The documentation of fgbio AnnotateBamWithUmis states, that this tool will read the entire input UMI fastq files into memory in an uncompressed format. As we work with gzipped fastq files, I would expect this to take about 4x the size of the input fastq.gz files according to Table 2 of this paper: Marius Nicolae and others, LFQC: a lossless compression algorithm for FASTQ files, Bioinformatics, Volume 31, Issue 20, October 2015, Pages 3276–3281, https://doi.org/10.1093/bioinformatics/btv384

As we should plan for some extra head space, but also have the bam file as another input, I think that 4*input.size_mb should be a good estimate.

This can be rather heavy on the memory requirements, but this should be fine on modern servers and cluster systems -- and I think this workflow should usually be run on bigger compute infrastructure. So I think this is acceptable, but as an alternative we could sort the fastq.gz files beforehand and then use the fgbio AnnotateBamWithUmis flag --sorted.

dlaehnemann commented 10 months ago

The other option would be to switch to something that does the UMI annotation earlier. I was already wondering why this only happens after mapping, with tools like fgbio fastqToBam that should be able to do the UMI extraction directly from fastq files. As the author of AnnotateBamWithUmis puts it, this tool should be the last resort for UMI annotation. Or are there important reasons that I missed, for doing the UMI annotation so late?

FelixMoelder commented 10 months ago

Should we really go for 4*input.size_mb? In my specific case that you be (21+18)*4=156gb. Depending on the hardware this could easily exceed the available memory and impact systems stability. For my scenario I tried some memory settings and 80gb where sufficient (probably a bit less as I also tried 40gb first and then 80gb). So something like 4*fastq+bam should be sufficient. But I do not see that there is an option for getting the size of separate input files. Can we alternatively set some upper limit? E.g. lambda wc, input: max(4*input.size_mb, 0.9*max_memory)? Not sure if snakemake has any information of the total system memory.

FelixMoelder commented 10 months ago

The other option would be to switch to something that does the UMI annotation earlier. I was already wondering why this only happens after mapping, with tools like fgbio fastqToBam that should be able to do the UMI extraction directly from fastq files. As the author of AnnotateBamWithUmis puts it, this tool should be the last resort for UMI annotation. Or are there important reasons that I missed, for doing the UMI annotation so late?

I think transforming the fastq into a bam file makes things more complicated. In general reads are mapped after adapter trimming, while the UMIs are derived from the input fastqs as this information is lost after trimming. If we want to annotate at an early stage we would need to run FastqToBam first and then transform the annotated bam files into fastqs again in order to run Cutadapt. Maybe I missed something here but fastq -> bam -> fastq -> trimming -> mapping does not appear reasonable to me.

dlaehnemann commented 10 months ago

But cutadapt doesn't have any UMI functionality and should only remove sequencing adapters, which usually remain at the end of a read if the read is longer than the insert size / fragment length. But the UMIs, that are usually at the start of a read, should remain after cutadapt. So we should be able to run:

  1. cutadapt
  2. fgbio FastqToBam
  3. bwa mem

The latter does seem to imply samtools fastq to bring the BAM file back to FASTQ format, but this can then be piped directly on to bwa mem. A detailed description is here:

https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md#step-12-ubam---mapped-bam

This would also avoid having to again merge the untrimmed reads and reading all of them into memory for fgbio AnnotateBamWithUmis.

FelixMoelder commented 10 months ago

This sounds like a good option to me. We should probably implement this in a separate PR and go along with that memory consuming option for now to get the workflow running again.

dlaehnemann commented 10 months ago

Should we really go for 4*input.size_mb? In my specific case that you be (21+18)*4=156gb. Depending on the hardware this could easily exceed the available memory and impact systems stability. For my scenario I tried some memory settings and 80gb where sufficient (probably a bit less as I also tried 40gb first and then 80gb). So something like 4*fastq+bam should be sufficient. But I do not see that there is an option for getting the size of separate input files. Can we alternatively set some upper limit? E.g. lambda wc, input: max(4*input.size_mb, 0.9*max_memory)? Not sure if snakemake has any information of the total system memory.

OK, for such a case the 4* does seem a bit excessive. We probably expect the size of the BAM file and (all of the) FASTQ file(s) to be similar, right? So going for 2* or maybe 2.5* might be a bit better for the quick fix?

FelixMoelder commented 10 months ago

I would be happy with that especially as this would only a temporary solution until we replace the umi annotation as discussed.