Closed d-cameron closed 5 years ago
Also performed a minor refactor to remove the duplicated symbolic alllele detection code in Allele and VCFCodec
Merging #1264 into master will increase coverage by
0.001%
. The diff coverage is83.333%
.
@@ Coverage Diff @@
## master #1264 +/- ##
===============================================
+ Coverage 69.353% 69.354% +0.001%
- Complexity 8301 8307 +6
===============================================
Files 555 555
Lines 33116 33127 +11
Branches 5572 5575 +3
===============================================
+ Hits 22967 22975 +8
- Misses 7890 7894 +4
+ Partials 2259 2258 -1
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
...main/java/htsjdk/variant/vcf/AbstractVCFCodec.java | 73.62% <75%> (+0.614%) |
90 <1> (ø) |
:arrow_down: |
...ain/java/htsjdk/variant/variantcontext/Allele.java | 77.551% <85.714%> (+0.345%) |
95 <9> (+7) |
:arrow_up: |
...samtools/util/AsyncBlockCompressedInputStream.java | 72% <0%> (-4%) |
12% <0%> (-1%) |
Alternative approaches:
just remove the HashMap entirely and to have a fixed size cache of the small alleles you're likely to encounter. 680 Allele records covers all A/C/G/T ref and alt alleles up to 4bp in length. 2,728 covers up to 5bp. A fixed static cache works better when you're opening multiple files as the existing cache is per-codec anyway.
String.intern() the short strings
Replace the map with a WeakHashMap<String, WeakReference<String>>
so it doesn't leak
@d-cameron We had a similar proposal a while ago https://github.com/samtools/htsjdk/pull/845 which fell apart because no one could agree about the right thing to do so in the end we did nothing.
However, looking at this more it looks to me that the caching / interning of the allele strings is entirely pointless. It just gets passed into a call to Allele.create() which then re-encodes the string into bytes. I can't tell exactly, because the string encoding is complex and varies by local, but it doesn't look to me that it has any sort of string -> byte cache in the encoding stage. It looks like maybe the right thing to do would be to not bother caching any of the allele strings. Then in addition we could create a cache of String -> Allele's which would be used to reduce byte[] overhead from many duplicate AA
alleles...
@yfarjoun Can you take a look and see if you agree with my interpretation of what's happening? How did we not see that before when we looked at this?
I agree, @lbergelson however, I would like to note that the internedString map is also used for filters, reference names and genotypes. For those objects, the current solution might be OK.
Hi @d-cameron,
Are you interested in continuing to work on this branch or would you like us to take it over for you?
@lbergelson happy for you to take it over. The change is relatively small but the code reviewing will be much faster if you only have to do it internally and not bounce back and forward to Australia for each revision. I have a few other projects on the go and since I already have this patch packaged into GRIDSS, it's less urgent that the other work.
Thanks
Closing this because it's been replace by #1282
Description
VCFCodec caches previously seen allele strings which I presume is so that the millions of identical allele strings (e.g A, AA, AT, T) are weakly interned in the string HashMap<>. Unfortunately, when parsing large VCF with SV alleles, this causes a memory leak.
Since each allele string is retained until the VCFCodec is GCed, streaming a VCF containing SVs (e.g just counting # records) results in a continual increase in memory usage even when the VCF records are no longer used. For SVs, the allele string can easily be thousands of bytes in size.
Changes:
Checklist