d-cameron commented 5 years ago

Description

VCFCodec caches previously seen allele strings which I presume is so that the millions of identical allele strings (e.g A, AA, AT, T) are weakly interned in the string HashMap<>. Unfortunately, when parsing large VCF with SV alleles, this causes a memory leak.

Since each allele string is retained until the VCFCodec is GCed, streaming a VCF containing SVs (e.g just counting # records) results in a continual increase in memory usage even when the VCF records are no longer used. For SVs, the allele string can easily be thousands of bytes in size.

Changes:

Excluding breakpoint, single breakend allele strings from the cache.
Excluding ALT allele strings containing multiple alleles from the cache (String.split() gets called so there's no point in caching these at all)
I also put a safety check to exclude long alleles from the cache since they're very likely to be unique and not benefit from the caching at all.

Checklist

[x] Code compiles correctly
[x] New tests covering changes and new functionality
[x] All tests passing
[ ] Extended the README / documentation, if necessary
[ ] Is not backward compatible (breaks binary or source compatibility)

d-cameron commented 5 years ago

Also performed a minor refactor to remove the duplicated symbolic alllele detection code in Allele and VCFCodec

codecov-io commented 5 years ago

Codecov Report

Merging #1264 into master will increase coverage by 0.001%. The diff coverage is 83.333%.

@@               Coverage Diff               @@
##              master     #1264       +/-   ##
===============================================
+ Coverage     69.353%   69.354%   +0.001%     
- Complexity      8301      8307        +6     
===============================================
  Files            555       555               
  Lines          33116     33127       +11     
  Branches        5572      5575        +3     
===============================================
+ Hits           22967     22975        +8     
- Misses          7890      7894        +4     
+ Partials        2259      2258        -1

Impacted Files	Coverage Δ	Complexity Δ
...main/java/htsjdk/variant/vcf/AbstractVCFCodec.java	`73.62% <75%> (+0.614%)`	`90 <1> (ø)`	:arrow_down:
...ain/java/htsjdk/variant/variantcontext/Allele.java	`77.551% <85.714%> (+0.345%)`	`95 <9> (+7)`	:arrow_up:
...samtools/util/AsyncBlockCompressedInputStream.java	`72% <0%> (-4%)`	`12% <0%> (-1%)`

d-cameron commented 5 years ago

Alternative approaches:

just remove the HashMap entirely and to have a fixed size cache of the small alleles you're likely to encounter. 680 Allele records covers all A/C/G/T ref and alt alleles up to 4bp in length. 2,728 covers up to 5bp. A fixed static cache works better when you're opening multiple files as the existing cache is per-codec anyway.
String.intern() the short strings
Replace the map with a WeakHashMap<String, WeakReference<String>> so it doesn't leak

lbergelson commented 5 years ago

@d-cameron We had a similar proposal a while ago https://github.com/samtools/htsjdk/pull/845 which fell apart because no one could agree about the right thing to do so in the end we did nothing.

However, looking at this more it looks to me that the caching / interning of the allele strings is entirely pointless. It just gets passed into a call to Allele.create() which then re-encodes the string into bytes. I can't tell exactly, because the string encoding is complex and varies by local, but it doesn't look to me that it has any sort of string -> byte cache in the encoding stage. It looks like maybe the right thing to do would be to not bother caching any of the allele strings. Then in addition we could create a cache of String -> Allele's which would be used to reduce byte[] overhead from many duplicate AA alleles...

@yfarjoun Can you take a look and see if you agree with my interpretation of what's happening? How did we not see that before when we looked at this?

yfarjoun commented 5 years ago

I agree, @lbergelson however, I would like to note that the internedString map is also used for filters, reference names and genotypes. For those objects, the current solution might be OK.

lbergelson commented 5 years ago

Hi @d-cameron,

Are you interested in continuing to work on this branch or would you like us to take it over for you?

d-cameron commented 5 years ago

@lbergelson happy for you to take it over. The change is relatively small but the code reviewing will be much faster if you only have to do it internally and not bounce back and forward to Australia for each revision. I have a few other projects on the go and since I already have this patch packaged into GRIDSS, it's less urgent that the other work.

Thanks

lbergelson commented 5 years ago

Closing this because it's been replace by #1282

samtools / htsjdk

Fixed memory leak in VCFCodec #1264

Description

Checklist

Codecov Report