samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Improve calculation of available memory when sizing buffers in SortingCollection #1437

Closed tfenne closed 4 years ago

tfenne commented 4 years ago

Description

SortingCollection keeps insisting that it doesn't have enough memory to buffer files for reading. But this is testably false - if you keep the same number of items in memory, same number of files, and increase the heap size, it continues to insist there is not enough space to buffer.

The existing implementation uses a method that returns the amount of free memory in the currently allocated heap. It does not take into account that the heap may not have been expanded to it's max size as specified by -Xmx.

Things to think about before submitting:

codecov-io commented 4 years ago

Codecov Report

Merging #1437 into master will increase coverage by 0.002%. The diff coverage is 100%.

@@              Coverage Diff               @@
##              master    #1437       +/-   ##
==============================================
+ Coverage     68.358%   68.36%   +0.002%     
  Complexity      8493     8493               
==============================================
  Files            583      583               
  Lines          34410    34412        +2     
  Branches        5729     5729               
==============================================
+ Hits           23522    23524        +2     
  Misses          8667     8667               
  Partials        2221     2221
Impacted Files Coverage Δ Complexity Δ
...n/java/htsjdk/samtools/util/SortingCollection.java 64.835% <100%> (+0.391%) 16 <0> (ø) :arrow_down:
tfenne commented 4 years ago

Hey @lbergelson. For your first question, I think that's really old javadoc that is true but misleading. My understanding is that most JVMs have a default maximum heap size that is not infinity or the physical memory on a machine. E.g. for me running oracle jdk8 with no -Xmx (and no JAVA_OPTS) I see a max memory of ~7.1GB (on a machine with 32GB of RAM).

Re: (2) my reading of that post and experience is that maxMemory() actually gives you an underestimate. I.e. you say -Xmx8g and maxMemory() reports 7.1GB.

Honestly I think this whole strategy of trying to estimate the available memory is questionable. Runtime.gc() is asynchronous, so even if you call it there's zero guarantee that space is reclaimed before calling other methods. Probably the correct solution here is to iteratively attempt to allocate the buffers, reducing size until it succeeds. But that's a bigger change and I'm just trying to make it so that I stop getting unbuffered readers all the time, which is really killing performance.