mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
449 stars 75 forks source link

Recursion error when with bfgz files #131

Closed arivers closed 6 years ago

arivers commented 6 years ago

This issue relates to issue #125. When accessing a 299GB bfgz fasta file created by Pbgzip, I encounter a recursion error that comes from the Biopython module bfgz.

  File "/global/homes/a/arrivers/.conda/envs/vicaenv/lib/python3.6/site-packages/pyfaidx/__init__.py", line 574, in from_file
    chunk_seq = self.file.read(chunk).decode()
  File "/global/homes/a/arrivers/.conda/envs/vicaenv/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  File "/global/homes/a/arrivers/.conda/envs/vicaenv/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  File "/global/homes/a/arrivers/.conda/envs/vicaenv/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  [Previous line repeated 979 more times]
  File "/global/homes/a/arrivers/.conda/envs/vicaenv/lib/python3.6/site-packages/Bio/bgzf.py"0, line 648, in read
    self._load_block()  # will reset offsets
  File "/global/homes/a/arrivers/.conda/envs/vicaenv/lib/python3.6/site-packages/Bio/bgzf.py", line 577, in _load_block
    block_size, self._buffer = _load_bgzf_block(handle, self._text)
  File "/global/homes/a/arrivers/.conda/envs/vicaenv/lib/python3.6/site-packages/Bio/bgzf.py", line 412, in _load_bgzf_block
    if magic != _bgzf_magic:
RecursionError: maximum recursion depth exceeded in comparison

My Pyfaidx version is 0.5.1 from bioconda and my Biopython version is 1.70 installed from pip.

mdshw5 commented 6 years ago

Hmmm. This is interesting. @peterjc left a hint about avoiding recursion in the readline method that's causing this RecursionError:

    def readline(self):
        """Read a single line for the BGZF file."""
        i = self._buffer.find(self._newline, self._within_block_offset)
        # Three cases to consider,
        if i == -1:
            # No newline, need to read in more data
            data = self._buffer[self._within_block_offset:]
            self._load_block()  # will reset offsets
            if not self._buffer:
                return data  # EOF
            else:
                # TODO - Avoid recursion
                return data + self.readline()
        elif i + 1 == len(self._buffer):
            # Found new line, but right at end of block (SPECIAL)
            data = self._buffer[self._within_block_offset:]
            # Must now load the next block to ensure tell() works
            self._load_block()  # will reset offsets
            assert data
            return data
        else:
            # Found new line, not at end of block (easy case, no IO)
            data = self._buffer[self._within_block_offset:i + 1]
            self._within_block_offset = i + 1
            # assert data.endswith(self._newline)
            return data

My guess, from looking at the bgzf code, is that your 299GB (!!!) file has really, really long lines, or maybe the file is corrupted at a line break.

mdshw5 commented 6 years ago

Adding @KwatME since he raised this issue earlier as well.

peterjc commented 6 years ago

From skimming that section of my code (Bio/bgzf.py), I would guess this file has a really really long line as @mdshw5 suggests. I take it this is a BGZF compressed FASTA file, so this could be a very long sequence like an entire chromosome with no line breaks?

It should be trivial to confirm (or reject) this guess, perhaps as simple as:

import gzip
handle = gzip.open('problem-file.gz')
for line in handle:
    if len(line) > 100: 
        print(len(line))
handle.close()

Is there a public copy of this problem file? [Update - See my next comment]

Can you reduce this to a test case using just bgzf.py (and not pyfaidx as well)? If so we may want to log this as a bug over on Biopython.

Peter

peterjc commented 6 years ago

I presume from the details on #125 that @KwatME was using this file (840Mb):

http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

mdshw5 commented 6 years ago

Updated #125 - was able to reproduce the issue there.

peterjc commented 6 years ago

Can we close #131 as a duplicate of #125?

mdshw5 commented 6 years ago

Yes I'll close this as it is a duplicate of #125

arivers commented 5 years ago

Thanks @mdshw5 and @peterjc for fixing this recursion error. I look forward to trying out the new improved pyfaidx.

peterjc commented 5 years ago

Most of the credit goes to @rtf-const for refactoring my recursive functions into loops, but multiple people have helped too. We'd best get the new Biopython release out soon so you can all use this more easily.