sstadick / perbase

Per-base per-nucleotide depth analysis
MIT License
117 stars 14 forks source link

from_pileup_mate_aware very inefficient? #64

Open brentp opened 1 year ago

brentp commented 1 year ago

Hi Seth, if I understand correctly, from_pileup_mate_aware is run for each column in a pileup. This means it is grouping, sorting, grouping, etc for each position, often with the same sets of reads for each consecutive column.

Do I misread? If not then this will be an extreme bottleneck as it's O(n^2) or worse. We can mitigate by checking if the end of the left-most read is less than the start of the right-most read. Then at least the cost could be minimized to only the percent of reads that overlap.

brentp commented 1 year ago

mosdepth uses such a check (if read is strictly less than mate) and if not, it saves the read into a hashtable. I think that can work here, but I'm not sure if I miss something about the implementation.

sstadick commented 1 year ago

Nope, you aren't missing anything! That is how it does it and it for sure slows things down. I do like the idea in your PR if the functions from htslib fall into place.