samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

making changes to reduce size of giant interval lists #1309

Closed lbergelson closed 5 years ago

lbergelson commented 5 years ago

Description

Two separate changes which should reduce the in-memory size of IntervalList and OverlapDetector.

1: Cache the last seen contig name when loading IntervalLists and reuse the same String to avoid keeping many duplicate copies of "chr1" in memory. This should have very minor overhead since interval files are sorted it should be about as effective as a more complicated caching scheme.

  1. Use SingletonSets as the leaf objects in OverlapDetector when constructing it. These are substantially smaller than HashSets of the same size. Change to HashSet only when we find 2 or more identically positioned elements. This will have a minor performance cost while constructing the OverlapDetector but should result in dramatically decreased size in memory.

Checklist

lbergelson commented 5 years ago

@samuelklee Simple changes to reduce the in memory size of interval lists. @nh13 This might benefit you as well since it sounds like you're dealing with very large interval lists as well.

codecov-io commented 5 years ago

Codecov Report

Merging #1309 into master will increase coverage by 0.996%. The diff coverage is 85.714%.

@@               Coverage Diff               @@
##              master     #1309       +/-   ##
===============================================
+ Coverage     67.728%   68.724%   +0.996%     
- Complexity      8225      8855      +630     
===============================================
  Files            560       562        +2     
  Lines          33493     35503     +2010     
  Branches        5637      6165      +528     
===============================================
+ Hits           22684     24399     +1715     
- Misses          8632      8835      +203     
- Partials        2177      2269       +92
Impacted Files Coverage Δ Complexity Δ
...ain/java/htsjdk/samtools/util/OverlapDetector.java 96.104% <100%> (ø) 27 <3> (+2) :arrow_up:
...c/main/java/htsjdk/samtools/util/IntervalList.java 74.754% <100%> (+0.335%) 75 <0> (+1) :arrow_up:
...c/main/java/htsjdk/samtools/util/IntervalTree.java 75.052% <66.667%> (+0.476%) 45 <2> (+3) :arrow_up:
...tools/cram/encoding/readfeatures/Substitution.java 83.051% <0%> (-9.054%) 19% <0%> (-3%)
...a/htsjdk/samtools/cram/build/ContainerFactory.java 95.238% <0%> (-1.732%) 9% <0%> (+3%)
...va/htsjdk/samtools/cram/build/ContainerParser.java 95.122% <0%> (-1.174%) 19% <0%> (+1%)
...samtools/cram/structure/CramCompressionRecord.java 92.593% <0%> (-0.376%) 161% <0%> (+48%)
...oding/reader/MultiRefSliceAlignmentSpanReader.java 100% <0%> (ø) 9% <0%> (+4%) :arrow_up:
...ava/htsjdk/samtools/cram/ref/ReferenceContext.java 91.667% <0%> (ø) 22% <0%> (?)
...htsjdk/samtools/cram/ref/ReferenceContextType.java 100% <0%> (ø) 1% <0%> (?)
... and 31 more
samuelklee commented 5 years ago

Thanks for making these changes! I think the per-contig OverlapDetector you suggested was enough to bring down memory in CollectReadCounts, so I didn't see any change there. However, I do see a substantial improvement in ModelSegments, which still uses global OverlapDetectors. I was able to run with ~10M bins and <8GB memory, while master required ~10GB. Runtime also actually went down slightly from 18 minutes to 17 minutes, but that might be within the noise.

lbergelson commented 5 years ago

@yfarjoun Could you take a look. The implementation in interval tree is slightly more complicated because it has to deal with arbitrary sentinel values instead of just null, and I added some tests because I realized that the conditions I cared about didn't look well covered.

lbergelson commented 5 years ago

@yfarjoun It's an improvement. Thanks for the review.

lbergelson commented 5 years ago

@yfarjoun Thanks for the reviews!