Closed lbergelson closed 5 years ago
@samuelklee Simple changes to reduce the in memory size of interval lists. @nh13 This might benefit you as well since it sounds like you're dealing with very large interval lists as well.
Merging #1309 into master will increase coverage by
0.996%
. The diff coverage is85.714%
.
@@ Coverage Diff @@
## master #1309 +/- ##
===============================================
+ Coverage 67.728% 68.724% +0.996%
- Complexity 8225 8855 +630
===============================================
Files 560 562 +2
Lines 33493 35503 +2010
Branches 5637 6165 +528
===============================================
+ Hits 22684 24399 +1715
- Misses 8632 8835 +203
- Partials 2177 2269 +92
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
...ain/java/htsjdk/samtools/util/OverlapDetector.java | 96.104% <100%> (ø) |
27 <3> (+2) |
:arrow_up: |
...c/main/java/htsjdk/samtools/util/IntervalList.java | 74.754% <100%> (+0.335%) |
75 <0> (+1) |
:arrow_up: |
...c/main/java/htsjdk/samtools/util/IntervalTree.java | 75.052% <66.667%> (+0.476%) |
45 <2> (+3) |
:arrow_up: |
...tools/cram/encoding/readfeatures/Substitution.java | 83.051% <0%> (-9.054%) |
19% <0%> (-3%) |
|
...a/htsjdk/samtools/cram/build/ContainerFactory.java | 95.238% <0%> (-1.732%) |
9% <0%> (+3%) |
|
...va/htsjdk/samtools/cram/build/ContainerParser.java | 95.122% <0%> (-1.174%) |
19% <0%> (+1%) |
|
...samtools/cram/structure/CramCompressionRecord.java | 92.593% <0%> (-0.376%) |
161% <0%> (+48%) |
|
...oding/reader/MultiRefSliceAlignmentSpanReader.java | 100% <0%> (ø) |
9% <0%> (+4%) |
:arrow_up: |
...ava/htsjdk/samtools/cram/ref/ReferenceContext.java | 91.667% <0%> (ø) |
22% <0%> (?) |
|
...htsjdk/samtools/cram/ref/ReferenceContextType.java | 100% <0%> (ø) |
1% <0%> (?) |
|
... and 31 more |
Thanks for making these changes! I think the per-contig OverlapDetector you suggested was enough to bring down memory in CollectReadCounts, so I didn't see any change there. However, I do see a substantial improvement in ModelSegments, which still uses global OverlapDetectors. I was able to run with ~10M bins and <8GB memory, while master required ~10GB. Runtime also actually went down slightly from 18 minutes to 17 minutes, but that might be within the noise.
@yfarjoun Could you take a look. The implementation in interval tree is slightly more complicated because it has to deal with arbitrary sentinel values instead of just null, and I added some tests because I realized that the conditions I cared about didn't look well covered.
@yfarjoun It's an improvement. Thanks for the review.
@yfarjoun Thanks for the reviews!
Description
Two separate changes which should reduce the in-memory size of IntervalList and OverlapDetector.
1: Cache the last seen contig name when loading IntervalLists and reuse the same String to avoid keeping many duplicate copies of "chr1" in memory. This should have very minor overhead since interval files are sorted it should be about as effective as a more complicated caching scheme.
Checklist