samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Fixed memory leak in VCFCodec #1264

Closed d-cameron closed 5 years ago

d-cameron commented 5 years ago

Description

VCFCodec caches previously seen allele strings which I presume is so that the millions of identical allele strings (e.g A, AA, AT, T) are weakly interned in the string HashMap<>. Unfortunately, when parsing large VCF with SV alleles, this causes a memory leak.

Since each allele string is retained until the VCFCodec is GCed, streaming a VCF containing SVs (e.g just counting # records) results in a continual increase in memory usage even when the VCF records are no longer used. For SVs, the allele string can easily be thousands of bytes in size.

Changes:

Checklist

d-cameron commented 5 years ago

Also performed a minor refactor to remove the duplicated symbolic alllele detection code in Allele and VCFCodec

codecov-io commented 5 years ago

Codecov Report

Merging #1264 into master will increase coverage by 0.001%. The diff coverage is 83.333%.

@@               Coverage Diff               @@
##              master     #1264       +/-   ##
===============================================
+ Coverage     69.353%   69.354%   +0.001%     
- Complexity      8301      8307        +6     
===============================================
  Files            555       555               
  Lines          33116     33127       +11     
  Branches        5572      5575        +3     
===============================================
+ Hits           22967     22975        +8     
- Misses          7890      7894        +4     
+ Partials        2259      2258        -1
Impacted Files Coverage Δ Complexity Δ
...main/java/htsjdk/variant/vcf/AbstractVCFCodec.java 73.62% <75%> (+0.614%) 90 <1> (ø) :arrow_down:
...ain/java/htsjdk/variant/variantcontext/Allele.java 77.551% <85.714%> (+0.345%) 95 <9> (+7) :arrow_up:
...samtools/util/AsyncBlockCompressedInputStream.java 72% <0%> (-4%) 12% <0%> (-1%)
d-cameron commented 5 years ago

Alternative approaches:

lbergelson commented 5 years ago

@d-cameron We had a similar proposal a while ago https://github.com/samtools/htsjdk/pull/845 which fell apart because no one could agree about the right thing to do so in the end we did nothing.

However, looking at this more it looks to me that the caching / interning of the allele strings is entirely pointless. It just gets passed into a call to Allele.create() which then re-encodes the string into bytes. I can't tell exactly, because the string encoding is complex and varies by local, but it doesn't look to me that it has any sort of string -> byte cache in the encoding stage. It looks like maybe the right thing to do would be to not bother caching any of the allele strings. Then in addition we could create a cache of String -> Allele's which would be used to reduce byte[] overhead from many duplicate AA alleles...

@yfarjoun Can you take a look and see if you agree with my interpretation of what's happening? How did we not see that before when we looked at this?

yfarjoun commented 5 years ago

I agree, @lbergelson however, I would like to note that the internedString map is also used for filters, reference names and genotypes. For those objects, the current solution might be OK.

lbergelson commented 5 years ago

Hi @d-cameron,

Are you interested in continuing to work on this branch or would you like us to take it over for you?

d-cameron commented 5 years ago

@lbergelson happy for you to take it over. The change is relatively small but the code reviewing will be much faster if you only have to do it internally and not bounce back and forward to Australia for each revision. I have a few other projects on the go and since I already have this patch packaged into GRIDSS, it's less urgent that the other work.

Thanks

lbergelson commented 5 years ago

Closing this because it's been replace by #1282