feat: reduce fgbio memory usage

FelixMoelder commented 2 months ago

For large samples fgbio AnnotateBamWithUmis fails due to memory issues. This happens as fgbio loads the fully uncompressed UMI fastq file into memory. We already tried to fix this in a previous PR (https://github.com/snakemake-workflows/dna-seq-varlociraptor/pull/262) by dynamically assigning more memory based on input file size. Still this does not suffice in many cases.

To get this fixed fgbio suggests a best practice where raw reads are transformed into unmapped bam files while annotating UMIs at the same time. In our case this is not feasible as reads need to be adapter trimmed by cutadapt first. But in case of Qiagen reads the UMI lies in front of the adapter sequence (source) and therefore can not be annotated while transforming the reads into ubam.

Alternatively, fgbio AnnotateBamWithUmis allows to input a queryname sorted bam and fastq file reading the UMIs sequentially from the fastq file, first. fastq files can be sorted by fgbio SortFastq. I intended to directly sort the mapped reads by queryname when running map_reads but it showed that queryname sorting differs between fgbio and samtools receiving incompatible files. So, it is necessary to reorder the mapped reads by fgbio SortBam before annotating UMIs.

PS: Not sure if we treat this as fix or feature

dlaehnemann commented 2 months ago

Can we not do the standard best practices and then do the adapter trimming once we have gone back to FASTQ files? So basically do the ubam loop before running cutadapt?

dlaehnemann commented 2 months ago

And otherwise, I didn't fully understand the queryname sorting problem. Does fgbio AnnotateBamWithUmis not allow for using samtools sort -n files? What's the exact issue there?

And would samtools collate be an alternative, as this would be quicker:

https://www.htslib.org/doc/samtools-collate.html

FelixMoelder commented 2 months ago

And otherwise, I didn't fully understand the queryname sorting problem. Does fgbio AnnotateBamWithUmis not allow for using samtools sort -n files? What's the exact issue there?

And would samtools collate be an alternative, as this would be quicker:

https://www.htslib.org/doc/samtools-collate.html

I tried samtools sort -n first. But it turned out that the lexicographic order of samtools is not the same as the order of fgbio SortFastq and therefore annotation fails.

Also samtools collate will not help as the sorting between fastqs and bams must be identical.

FelixMoelder commented 2 months ago

Can we not do the standard best practices and then do the adapter trimming once we have gone back to FASTQ files? So basically do the ubam loop before running cutadapt?

We might could do that but I do not see why it should be a best practice to transform fastq -> ubam -> fastq -> trimming. Also we would need to write the annotated UMIs from the unmapped bam into the description fields of the fastqs again. And after mapping the umis need to be written to the bam record over again. All of this appears to me like a worst practice.

dlaehnemann commented 2 months ago

Oh, yeah. I totally agree that there doesn't seem to be an "elegant" solution here. But from looking at everything, I think the least redundant work to do, is if we simply do the best practices and only insert the cutadapt adapter trimming in step 1.2 (after samtools fastq, before bwa mem): https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md#step-12-ubam---mapped-bam

We might have to make sure that cutadapt does not fully discard reads, even if their sequence is completely trimmed. And I think samtools fastq in this setup produces interleaved fastq, so we have to make sure cutadapt works properly in this scenario.

BTW, many thanks for taking care of this -- I know it's a lot of hassle!

FelixMoelder commented 2 months ago

I just spend a thought on how the "best practice" actually works. What they do is to annotated the fastqs in the first step and output a ubam file. This bam-file is then transformed to fastq once again and mapped against the reference. The actual annotation is then performed by fgbio ZipperBams by inputing the annotated ubam and mapped bam file. So, I do not see why the transformation of the ubam into a fastq is even necessary as one can just use the initial fastqs, map these and then perform annotation. In our case this would just be a separate step. I just did a quick drawing of the workflow (sorry for my handwriting; boxes are rules).

In that case we would just skip the ubam -> fastq step and still could trim the initial fastqs.

But, is this any better than the current implementation? The best practice adds two steps:

annotating umis (by creating ubams)
annotating umis to mapped reads (using the annotated ubam)

Our implementation also adds two steps:

query sorting initial fastqs
annotating bam by initial fastq

If I should guess I would assume that the best practice is a bit faster (annotation ubam vs query sorting) and probably also required less disk space (bam vs untrimmed fastqs). Disk space can probably be ignored as the output is just temporary.

Edit: I just realized that ZipperBams also requires a queryname sorting:

Both the unmapped and mapped BAMs must be a) queryname sorted or grouped (i.e. all records with the same name are grouped together in the file), and b) have the same ordering of querynames. If either of these are violated the output is undefined!

This means we have the same issue as for the current implementation and will end up with additional steps for the best practice which makes it a bad practice. ;)

FelixMoelder commented 2 months ago

I just sketched both options to get an impression what the workflow would look like if we consider that both options would require queryname sorted files.

Best Practice: Best_Practice

Alternative:

The only differences are that the best practice requires an additional step (transformation into ubam) but therefore we could use samtools for queryname sorting as we have only bam files for annotation. The alternative does not require any file type transformation but as queryname sorting is not supported for fastq files by samtools we would need to do queryname sorting by fgbio for fastqs and during mapping.

dlaehnemann commented 2 months ago

Many thanks for all the thoughts and sketches. I think I see mainly two good options, and I mention a third and why I don't think it is a good idea... :sweat_smile:

A) query sort fastq, integrate with AnnotateBamWithUmis

I think this option is actually closest to our current implementation, so would be minimally invasive with regard to the current workflow status. It's basically your last sketch, and I think the only necessary additions will be:

The option --sorted to AnnotateBamWithUmis.
An added fgbio SortFastq rule/step to get a query name sorted fastq.
An added fgbio SortBam rule/step to get a query name sorted bam.
An added samtools sort rule/step to get a coordinate sorted bam at the end (we might pipe fgbio AnnotateBamWithUmis output directly into samtools sort to avoid one round of writing to disc).

With this, we initially get two pathways starting from the raw fastq files:

fastq -> cutadapt -> bwa mem -> fgbio SortBam -> query sorted bam
fastq -> fgbio SortFastq -> query sorted fastq

Then, we integrate the query sorted fastq and bam with fgbio AnnotateBamWithUmis -> samtools sort -> coordinate sorted bam.

B) cutadapt as-is, followed by best practice

Could you double-check, that the UMIs get trimmed out by cutadapt in the Qiagen setup you quoted above? To me it really looks like the UMIs come before the Nextera Transposase 2 adapter, which I think should be the adapter to trim in this setup. So the UMIs should remain in the trimmed reads and then the regular best practices should work:

cutadapt -> trimmed fastqs
fgbio FastqToBam -> unmapped bam
samtools fastq | bwa mem -p | fgbio ZipperBams -> mapped bam
fgbio GroupReadsByUmi -> mapped bam with UMI read group annotations (do we need this for deduplication?)

C) cutadapt injected into best practices

I think this is a non-option, see reasons at the end. But I do want to document thinking this through... :sweat_smile:

If adapter trimming with cutadapt does indeed remove the UMIs in the Qiagen setting, we could instead inject it into the pipe in step 3. above. For this, we would need the --interleaved flag of cutadapt:

fgbio FastqToBam -> unmapped bam
samtools fastq | cutadapt --interleaved | bwa mem -p | fgbio ZipperBams -> mapped bam
fgbio GroupReadsByUmi -> mapped bam with UMI read group annotations (do we need this for deduplication?)

However, this has several downsides:

Standard fastqc cannot be run on the trimmed fastq, as the trimmed fastq does not get saved. If we do opt for saving it, we generate even more IO.
The workflow wiring gets more complicated, because cutadapt gets run at different points in the workflow, depending on whether we need to annotate UMIs or not.

dlaehnemann commented 2 months ago

Looking at the changes, A) is pretty much what you ended up implementing. So let's stick with that solution for now, I'll review the current state.

FelixMoelder commented 2 months ago

I would also advocate for solution A. Queryname sorting the bam via fgbio SortBam is not necessary as a distinct rule as I already updated the bwa mem snakemake-wrapper to support fgbio for sorting in addition to samtools and picard. The only thing I probably should do is to rename the parameter function for fgbio AnnotateBamWithUmis and to replace the short sorting parameter by the long one for comprehensibility.

snakemake-workflows / dna-seq-varlociraptor