samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Long reference sequences #1380

Open SergejN opened 5 years ago

SergejN commented 5 years ago

Dear developers,

i found a bug while working with very long sequences (Axolotl). FastaSequenceFile::readSequence increases the size of the internal buffer if the number of bases read so far is equal to the array size (line 177): if (sequenceLength == bases.length) Although it is a memory-efficient approach, unfortunately, it runs into problems if the sequence length is even minimally longer than 2^30-1, since then the method tries to allocate an array with more than 2^31-1 elements, which results in the array size being negative. I would suggest to check if the current array size is 2^30 and increment the internal array in smaller steps (say final byte[] tmp = new byte[(int)(bases.length*1.1)] instead of final byte[] tmp = new byte[bases.length*2] or switch to a different data structure, which I imaging, would be quite tedious. As of now I was able to solve that problem as described above for my project, but I admit it's probably not the best solution.

Thanks! Sergej