fix: canonical transcript mapped read extraction

dlaehnemann commented 1 year ago

The main aim of this PR is to get the extraction of reads mapped to the canonical transcripts in the rule get_mapped_canonical_transcripts down from about 44GB of memory usage regardless of the input BAM file size and hours of grepping, down to seconds / minutes of extraction time with almost no memory footprint. This happens here: https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/compare/fix-canonical-transcript-mapped-read-extraction?expand=1#diff-6562a38fb77f8839a8731b8a882bf0d0b683d6268cdffb433dcbe1f360ccedc4R103

This required using a BED file, and thus I switched to generating a valid BED file in the rule get_canonical_ids and switched to using that BED file for keeping track of transcript strand information instead of hacking this into the contig names of the reference fasta file. This lead to some cleanup in the workflow.

Other things that happened along the way are:

removed an unnecessary r-dplyr dependency, as this is pulled in by r-tidyverse anyways
cleaned up file names to more clearly reflect what they contain
cleaned up variable / column names in python script to use only lowercase letters
removed some redundancy in wrapper calling by re-useing the samtools index rule -- thus, we only have to update the wrapper version in one place and all instances should always stay in sync

For now, this is not yet tested. So I'll mark this as a draft to start with. But I wanted it up here to be able to test it on different setups by checking out the branch.

dlaehnemann commented 1 year ago

Just to document what still needs to be done, here:

Currently, the rule get_canonical_transcripts skips most transcripts, as the poly-A tails have been removed and the lengths given in the BED input file thus don't match with the actual lengths of the FASTA entries any more. Also, we have to find a way to not have the coordinates of start and end of a transcript appended to the FASTA entry names, because this will otherwise break downstream transcript name matching.

johanneskoester commented 11 months ago

LGTM, but there are conflicts with the master branch that need to be fixed before merging.

dlaehnemann commented 11 months ago

Thanks for looking through this. Will merge and create a new release as soon as it passes with the conflicts resolution...

snakemake-workflows / rna-seq-kallisto-sleuth

fix: canonical transcript mapped read extraction #77