Closed arivers closed 6 years ago
Hmmm. This is interesting. @peterjc left a hint about avoiding recursion in the readline
method that's causing this RecursionError
:
def readline(self):
"""Read a single line for the BGZF file."""
i = self._buffer.find(self._newline, self._within_block_offset)
# Three cases to consider,
if i == -1:
# No newline, need to read in more data
data = self._buffer[self._within_block_offset:]
self._load_block() # will reset offsets
if not self._buffer:
return data # EOF
else:
# TODO - Avoid recursion
return data + self.readline()
elif i + 1 == len(self._buffer):
# Found new line, but right at end of block (SPECIAL)
data = self._buffer[self._within_block_offset:]
# Must now load the next block to ensure tell() works
self._load_block() # will reset offsets
assert data
return data
else:
# Found new line, not at end of block (easy case, no IO)
data = self._buffer[self._within_block_offset:i + 1]
self._within_block_offset = i + 1
# assert data.endswith(self._newline)
return data
My guess, from looking at the bgzf
code, is that your 299GB (!!!) file has really, really long lines, or maybe the file is corrupted at a line break.
Adding @KwatME since he raised this issue earlier as well.
From skimming that section of my code (Bio/bgzf.py
), I would guess this file has a really really long line as @mdshw5 suggests. I take it this is a BGZF compressed FASTA file, so this could be a very long sequence like an entire chromosome with no line breaks?
It should be trivial to confirm (or reject) this guess, perhaps as simple as:
import gzip
handle = gzip.open('problem-file.gz')
for line in handle:
if len(line) > 100:
print(len(line))
handle.close()
Is there a public copy of this problem file? [Update - See my next comment]
Can you reduce this to a test case using just bgzf.py
(and not pyfaidx
as well)? If so we may want to log this as a bug over on Biopython.
Peter
I presume from the details on #125 that @KwatME was using this file (840Mb):
Updated #125 - was able to reproduce the issue there.
Can we close #131 as a duplicate of #125?
Yes I'll close this as it is a duplicate of #125
Thanks @mdshw5 and @peterjc for fixing this recursion error. I look forward to trying out the new improved pyfaidx.
Most of the credit goes to @rtf-const for refactoring my recursive functions into loops, but multiple people have helped too. We'd best get the new Biopython release out soon so you can all use this more easily.
This issue relates to issue #125. When accessing a 299GB bfgz fasta file created by Pbgzip, I encounter a recursion error that comes from the Biopython module bfgz.
My Pyfaidx version is 0.5.1 from bioconda and my Biopython version is 1.70 installed from pip.