Closed dlaehnemann closed 10 months ago
Good point, done.
Yes, the specification should definitely be dynamic. I was already planning this but hadn't gotten around to looking into a useful calculation. Here's my reasoning for what I now suggest:
The documentation of fgbio AnnotateBamWithUmis
states, that this tool will read the entire input UMI fastq files into memory in an uncompressed format. As we work with gzipped fastq files, I would expect this to take about 4x the size of the input fastq.gz
files according to Table 2 of this paper:
Marius Nicolae and others, LFQC: a lossless compression algorithm for FASTQ files, Bioinformatics, Volume 31, Issue 20, October 2015, Pages 3276–3281, https://doi.org/10.1093/bioinformatics/btv384
As we should plan for some extra head space, but also have the bam
file as another input, I think that 4*input.size_mb
should be a good estimate.
This can be rather heavy on the memory requirements, but this should be fine on modern servers and cluster systems -- and I think this workflow should usually be run on bigger compute infrastructure. So I think this is acceptable, but as an alternative we could sort the fastq.gz
files beforehand and then use the fgbio AnnotateBamWithUmis
flag --sorted
.
The other option would be to switch to something that does the UMI annotation earlier. I was already wondering why this only happens after mapping, with tools like fgbio fastqToBam
that should be able to do the UMI extraction directly from fastq files. As the author of AnnotateBamWithUmis
puts it, this tool should be the last resort for UMI annotation. Or are there important reasons that I missed, for doing the UMI annotation so late?
Should we really go for 4*input.size_mb
? In my specific case that you be (21+18)*4=156gb
. Depending on the hardware this could easily exceed the available memory and impact systems stability. For my scenario I tried some memory settings and 80gb where sufficient (probably a bit less as I also tried 40gb first and then 80gb).
So something like 4*fastq+bam
should be sufficient. But I do not see that there is an option for getting the size of separate input files. Can we alternatively set some upper limit? E.g. lambda wc, input: max(4*input.size_mb, 0.9*max_memory)
? Not sure if snakemake has any information of the total system memory.
The other option would be to switch to something that does the UMI annotation earlier. I was already wondering why this only happens after mapping, with tools like
fgbio fastqToBam
that should be able to do the UMI extraction directly from fastq files. As the author ofAnnotateBamWithUmis
puts it, this tool should be the last resort for UMI annotation. Or are there important reasons that I missed, for doing the UMI annotation so late?
I think transforming the fastq into a bam file makes things more complicated. In general reads are mapped after adapter trimming, while the UMIs are derived from the input fastqs as this information is lost after trimming. If we want to annotate at an early stage we would need to run FastqToBam
first and then transform the annotated bam files into fastqs again in order to run Cutadapt
. Maybe I missed something here but fastq -> bam -> fastq -> trimming -> mapping
does not appear reasonable to me.
But cutadapt
doesn't have any UMI functionality and should only remove sequencing adapters, which usually remain at the end of a read if the read is longer than the insert size / fragment length. But the UMIs, that are usually at the start of a read, should remain after cutadapt. So we should be able to run:
cutadapt
fgbio FastqToBam
bwa mem
The latter does seem to imply samtools fastq
to bring the BAM file back to FASTQ format, but this can then be piped directly on to bwa mem
. A detailed description is here:
This would also avoid having to again merge the untrimmed reads and reading all of them into memory for fgbio AnnotateBamWithUmis
.
This sounds like a good option to me. We should probably implement this in a separate PR and go along with that memory consuming option for now to get the workflow running again.
Should we really go for
4*input.size_mb
? In my specific case that you be(21+18)*4=156gb
. Depending on the hardware this could easily exceed the available memory and impact systems stability. For my scenario I tried some memory settings and 80gb where sufficient (probably a bit less as I also tried 40gb first and then 80gb). So something like4*fastq+bam
should be sufficient. But I do not see that there is an option for getting the size of separate input files. Can we alternatively set some upper limit? E.g.lambda wc, input: max(4*input.size_mb, 0.9*max_memory)
? Not sure if snakemake has any information of the total system memory.
OK, for such a case the 4*
does seem a bit excessive. We probably expect the size of the BAM file and (all of the) FASTQ file(s) to be similar, right? So going for 2*
or maybe 2.5*
might be a bit better for the quick fix?
I would be happy with that especially as this would only a temporary solution until we replace the umi annotation as discussed.
a
mem_gb
specification plus adefault-resources:
specification ofmem_mb
for a cluster system leads to multiple distinct resource definitions that can get confused -- so we should just stick to the standardmem_mb
here