mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
449 stars 75 forks source link

BGZF recursion issue #125

Closed KwatMDPhD closed 6 years ago

KwatMDPhD commented 7 years ago

Hi,

I used Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz and quered pyfaidx_handle['12'][111803961:111803962]. But the runtime is very slow and, with further investigation, I learned that I was getting maximum recursion error. I think using uncompressed .fa is faster and safer as of now. Thoughts?

Kind regards, Kwat

mdshw5 commented 6 years ago

Sorry for ignoring this issue - it slipped through the cracks! I'm going to close it and add you to the other similar issue #131.

mdshw5 commented 6 years ago

Just to double-check: did you compress this file with bgzip? There should be a check for the BGZF magic and you should get a pyfaidx.UnsupportedCompressionFormat error on a normal gzip file, but I just want to be thorough.

mdshw5 commented 6 years ago

I'm definitely able to reproduce your issue. (thanks @peterjc for pointing out the source file)

$ wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
--2018-01-06 20:21:07--  http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Resolving ftp.ensembl.org... 193.62.193.8
Connecting to ftp.ensembl.org|193.62.193.8|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 881214396 (840M) [application/x-gzip]
Saving to: 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz'

Homo_sapiens.GRCh38.dna.primary_assemb 100%[============================================================================>] 840.39M   393KB/s    in 24m 34s 

2018-01-06 20:45:41 (584 KB/s) - 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz' saved [881214396/881214396]

$ gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 
$ bgzip Homo_sapiens.GRCh38.dna.primary_assembly.fa 
$ python
Python 2.7.10 (default, Feb  7 2017, 00:08:15) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyfaidx import Fasta
>>> pyfaidx_handle = Fasta('Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz')
>>> pyfaidx_handle['12'][111803961:111803962]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 698, in __getitem__
    return self._fa.get_seq(self.name, start + 1, stop)[::step]
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 887, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 532, in fetch
    seq = self.from_file(name, start, end)
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 574, in from_file
    chunk_seq = self.file.read(chunk).decode()
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
    return data + self.read(size)
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
    return data + self.read(size)
...
...
...

yada, yada, yada ...

 File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
    return data + self.read(size)
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 644, in read
    self._load_block()  # will reset offsets
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 576, in _load_block
    block_size, self._buffer = _load_bgzf_block(handle, self._text)
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 411, in _load_bgzf_block
    if magic != _bgzf_magic:
RuntimeError: maximum recursion depth exceeded in cmp
>>> 

It appears that the line lengths are all 61 bytes, so this isn't a long line issue. I'll add some debug flags to pyfaidx and see if maybe it's a bug in how I'm indexing the file and calculating offsets. @peterjc: what would happen if I tried to read from a virtual offset that doesn't exist (out of bounds read)?

mdshw5 commented 6 years ago

Specifically I'm indexing a BGZF compressed file by calling the tell method to get the current offset into the beginning of each FASTA record:

https://github.com/mdshw5/pyfaidx/blob/8168b375f2131485888b5f554e0aac976367ee11/pyfaidx/__init__.py#L476

This returns a virtual offset which I then pass to the BGZF seek method, reading from the known virtual offset of the record start, read until the end position, and slice the beginning before returning (can't seek directly to the requested start coordinate because it might be in a different BGZF block).

https://github.com/mdshw5/pyfaidx/blob/8168b375f2131485888b5f554e0aac976367ee11/pyfaidx/__init__.py#L570-L575

Is there something obviously wrong with this?

mdshw5 commented 6 years ago

Ah, I think I know what's happening here. I'm calling BgzfReader.read() and asking for the data between chromosome 12 locations [1: 111803962]. This is due to the way I've decided to avoid potentially adding/subtracting virtual offsets in my Faidx.fetch() method. I think BgzfReader will hit the interpreter recursion limit any time you ask it to read a large string from a file with a large number of newline characters, such as this example.

It's not trivial, but I could try to come up with a way to be more efficient with my BGZF reads. @peterjc: maybe I'm missing something but it seems impossible to seek/read a specific substring efficiently that may span multiple BGZF blocks unless you store the offset of each block indexed to the string.

peterjc commented 6 years ago

Does this mean you can print out the problem offset so we can make a tiny test-case using this human FASTA file and just Bio/bgzf.py? That's be very helpful as I right now (continuing our discussion from #131, without having played with the code), I am puzzled as to what might be going wrong here.

Note I suspect both .readline() and .read() may be affected by this recursion issue (I left a TODO comment in both methods).

@mdshw5 do you have the length in bytes of the string you want to read, as well as the (BGZF virtual) offset of its start point? That is quite well tested because that's how Biopython's SeqIO get_raw functionality works on BGZF compressed files. However, it did take some careful coding to ensure I could use .tell() to get the virtual offset needed while building the index (i.e. avoiding any virtual pointer arithmetic because is not supported).

UnkindPartition commented 6 years ago

FWIW, I also ran into the same problem today, it seems.

% faidx -o "xxx.fa" -b "xxx.narrowPeak" "hg38.fa.gz"
Traceback (most recent call last):
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/bin/faidx", line 10, in <module>
    sys.exit(main())
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 197, in main
    write_sequence(args)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 52, in write_sequence
    for line in fetch_sequence(args, fasta, name, start, end):
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 69, in fetch_sequence
    sequence = fasta[name][start:end]
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 789, in __getitem__
    return self._fa.get_seq(self.name, start + 1, stop)[::step]
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 1015, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 613, in fetch
    seq = self.from_file(name, start, end)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 656, in from_file
    chunk_seq = self.file.read(chunk).decode()
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  [Previous line repeated 983 more times]
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 648, in read
    self._load_block()  # will reset offsets
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 577, in _load_block
    block_size, self._buffer = _load_bgzf_block(handle, self._text)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 412, in _load_bgzf_block
    if magic != _bgzf_magic:
RecursionError: maximum recursion depth exceeded in comparison
mdshw5 commented 6 years ago

I can confirm that the solution in https://github.com/biopython/biopython/issues/1701 fixes this issue. Due to my implementation there is still a large performance penalty for fetching small substrings near the end of a record, and I'll open an issue to remind me to explore a solution.

peterjc commented 6 years ago

Great. You'll be looking forward to Biopython 1.73 then which will be the first release to include this fix.

Thanks everyone 👍

mdshw5 commented 6 years ago

I'll add a version check around the BGZF code then.

KwatMDPhD commented 5 years ago

:)