samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Support very large sequence dictionaries/headers #1501

Closed nh13 closed 3 years ago

nh13 commented 4 years ago

This PR is meant to start a discussion; nitpicks will be ignored.

When we have very large sequence headers, tools will fail with OOM.

Quoting @jacarey's issue

My issue was in a readHeader. BufferedLineReader fromString reads into a byte array of len * max bytes per char for UTF8, which in my case was greater than 2^31 -1.

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
        at java.lang.StringCoding.encode(StringCoding.java:344)
        at java.lang.StringCoding.encode(StringCoding.java:387)
        at java.lang.String.getBytes(String.java:958)
        at htsjdk.samtools.util.BufferedLineReader.fromString(BufferedLineReader.java:57)
        at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:667)
codecov-commenter commented 4 years ago

Codecov Report

Merging #1501 into master will decrease coverage by 0.022%. The diff coverage is 16.667%.

@@               Coverage Diff               @@
##              master     #1501       +/-   ##
===============================================
- Coverage     69.345%   69.323%   -0.022%     
+ Complexity      8893      8892        -1     
===============================================
  Files            601       601               
  Lines          35436     35447       +11     
  Branches        5900      5901        +1     
===============================================
  Hits           24573     24573               
- Misses          8534      8543        +9     
- Partials        2329      2331        +2     
Impacted Files Coverage Δ Complexity Δ
.../java/htsjdk/samtools/util/BufferedLineReader.java 46.667% <16.667%> (-21.754%) 6.000 <1.000> (ø)
...dk/samtools/util/SAMRecordPrefetchingIterator.java 74.667% <0.000%> (-1.333%) 13.000% <0.000%> (-1.000%)
codecov-io commented 3 years ago

Codecov Report

Merging #1501 (25cfe32) into master (e803eea) will decrease coverage by 0.022%. The diff coverage is 16.667%.

@@               Coverage Diff               @@
##              master     #1501       +/-   ##
===============================================
- Coverage     69.345%   69.323%   -0.022%     
+ Complexity      8893      8892        -1     
===============================================
  Files            601       601               
  Lines          35436     35447       +11     
  Branches        5900      5901        +1     
===============================================
  Hits           24573     24573               
- Misses          8534      8543        +9     
- Partials        2329      2331        +2     
Impacted Files Coverage Δ Complexity Δ
.../java/htsjdk/samtools/util/BufferedLineReader.java 46.667% <16.667%> (-21.754%) 6.000 <1.000> (ø)
...dk/samtools/util/SAMRecordPrefetchingIterator.java 74.667% <0.000%> (-1.333%) 13.000% <0.000%> (-1.000%)
nh13 commented 3 years ago

@lbergelson and @pshapiro4broad my apologies for this dropping off, but I've updated based on folks' review, can you take another look?

nh13 commented 3 years ago

@lbergelson can you merge?

nh13 commented 3 years ago

@pshapiro4broad or @lbergelson I think this is ready for merging. Any objection to merging?