luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Slow temporary .bcf header writing #95

Closed gmagoon closed 4 years ago

gmagoon commented 4 years ago

I was running trio calling with hs38DH.fa and --threads option on recent develop branch head 3882f8f ... I noticed that the initial step of writing the headers for temporary .bcf files for each contig was taking a relatively large amount of time (a bit over a minute per contig, even for the small ones, which adds up with thousands of contigs).

I traced the bottleneck to get_call_types: https://github.com/luntergroup/octopus/blob/be006425202073729d249e1a93349f596685d3b4/src/core/octopus.cpp#L132

It appears that the temporary caller initialization used to determine the CallTypeSet may involve a fair amount of overhead. For example, in some sampled stack traces (octopus.issue95.backtraces.txt), I noticed some operations involving ReadSetProfile DepthStats. I'm guessing a lot of this is not strictly needed for getting the CallTypeSet? If so, it seems like there may be an opportunity for streamlining the performance. I suppose another route might be to just get the call types once (rather than per contig)?

dancooke commented 4 years ago

Thanks Greg. I've just pushed a commit (4b1bf402db585c49fba0bb4c28ad94782b5bdd6d) to develop that I think should address this problem.

gmagoon commented 4 years ago

works great, thanks Dan! sped it up by orders of magnitude