Question: iCount xlsites expected runtime

mirax87 commented 5 years ago

Hi,

in order to process our D. melanogaster iCLIP library, I used snakemake to put the iCount steps together and integrated benchmarking, specifically for iCount xlsites with quantification based on cDNA and reads.

Here, I am observing runtimes of ~1 - 4 days on our cluster system for iCount xlsites. The number of reads per multiplexing barcode is quite variable, which correlates with runtime.

In terms of parameters, I use

--group_by start
mapq_th 3

using the output gtf from iCount segment

I wonder what - next to total number of mapped reads - determines the runtime of iCount xlsites and whether there are some useful pre-filtering strategies of the BAM files to speed up the process without losing (too much) sensitivity.

Cheers

JureZmrzlikar commented 5 years ago

Hi @mirax87 !

Are you using --segmentation input? If you do, this i the main reason that iCount xlsites is taking so long. Please run it without segmentation (AFAIK, this is the way most users do it). We should speed up the algorithm in case segmentation is given, but never found the time to do it properly

Regarding other factors that could affect runtime:

group_by should have zero effect on runtime
higher mapq_th will take into account less (poorly mapped) reads, so this should speed things up a bit. But if the quality of mapping is suffcient this should not be very significant
If you have really high coverage (>10k, 100k), lowering the max_barcodes parameter can speed up things significantly, but this should be used only in such cases.

mirax87 commented 5 years ago

Hi @JureZmrzlikar,

you are right, I am using iCount xlsites --segmentation. I'll try without.

Thanks for the quick feedback. Cheers

tomazc / iCount

Question: iCount xlsites expected runtime #191