samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Optimize AbstractBAMFileIndex#query to avoid quadratic behavior when … #1396

Closed tomwhite closed 5 years ago

tomwhite commented 5 years ago

…reading sequences in order.

Description

Reading index sequences in the order they are defined in the header is a very common access pattern. However, the way it is implemented in AbstractBAMFileIndex means that it has quadratic complexity, since when reading sequence i in skipToSequence that is not in the cache, it will start from the beginning, even if sequence i - 1 is in the cache.

The effect is particularly stark when there are many reference sequences, such as hg38 which has over 3000. See https://github.com/disq-bio/disq/pull/109 for an example.

The fix is simple - look to see if sequence i - 1 is in the cache, and start from that position if it is. The change is covered by existing tests.

/cc @lbergelson

codecov-io commented 5 years ago

Codecov Report

Merging #1396 into master will increase coverage by 0.019%. The diff coverage is 100%.

@@              Coverage Diff               @@
##              master    #1396       +/-   ##
==============================================
+ Coverage     68.031%   68.05%   +0.019%     
- Complexity      8368     8372        +4     
==============================================
  Files            573      573               
  Lines          33886    33890        +4     
  Branches        5662     5663        +1     
==============================================
+ Hits           23053    23062        +9     
+ Misses          8643     8640        -3     
+ Partials        2190     2188        -2
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/samtools/CSIIndex.java 71.622% <ø> (ø) 59 <0> (ø) :arrow_down:
...ain/java/htsjdk/samtools/AbstractBAMFileIndex.java 85.119% <100%> (+0.363%) 50 <0> (+2) :arrow_up:
.../htsjdk/variant/variantcontext/VariantContext.java 77.714% <0%> (ø) 246% <0%> (ø) :arrow_down:
...samtools/util/AsyncBlockCompressedInputStream.java 76% <0%> (+4%) 13% <0%> (+1%) :arrow_up:
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java 89.474% <0%> (+10.526%) 4% <0%> (+1%) :arrow_up:
tomwhite commented 5 years ago

I've fixed this now. I wasn't sure how to add a direct test. Any suggestions, or is the existing coverage sufficient?

lbergelson commented 5 years ago

@tomwhite I wrote a test here: https://github.com/samtools/htsjdk/tree/skip-to-sequence-optimization

lbergelson commented 5 years ago

Closing since I merged #1397 which includes this.