nedialkova-lab / mim-tRNAseq

Modification-induced misincorporation tRNA sequencing
GNU General Public License v3.0
19 stars 14 forks source link

memory problems fixed by sorting before coverage check #25

Closed jabard89 closed 2 years ago

jabard89 commented 2 years ago

This is a wonderful tool, thank you so much for building it! The idea of using clustering was very clever. I was getting an intermittent error during the "Determining un-deconvoluted clusters due to insufficient coverage at mismatches" phase. I identified the problem as an out-of-memory issue in the splitClusters code when computing coverage for large bam files. The bedtools coverage documentation recommends sorting large bed files prior to computing coverage (https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html)

If you are trying to compute coverage for very large files and are having trouble with excessive memory usage, please presort your data by chromosome and then by start position (e.g., sort -k1,1 -k2,2n in.bed > in.sorted.bed for BED files) and then use the -sorted option. This invokes a memory-efficient algorithm designed for large files.

To implement this, I extract the chromosomes from the bam file, save that as a temporary 2 column chromosome file, then sort the bed file. This seems to have fixed my memory problems without breaking anything else. Best, Jared Bard

drewjbeh commented 2 years ago

Hi @jabard89! Thanks so much for the useful fix, this seems like a nice patch. I will merge as soon as I'm back in the office in a week. Also keep your eyes out for the new version (v0.4) and the documentation updates that will come along with it in the coming weeks, there are several major upgrades, especially to the deconvolution.