Closed KwatMDPhD closed 6 years ago
Sorry for ignoring this issue - it slipped through the cracks! I'm going to close it and add you to the other similar issue #131.
Just to double-check: did you compress this file with bgzip
? There should be a check for the BGZF magic and you should get a pyfaidx.UnsupportedCompressionFormat
error on a normal gzip file, but I just want to be thorough.
I'm definitely able to reproduce your issue. (thanks @peterjc for pointing out the source file)
$ wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
--2018-01-06 20:21:07-- http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Resolving ftp.ensembl.org... 193.62.193.8
Connecting to ftp.ensembl.org|193.62.193.8|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 881214396 (840M) [application/x-gzip]
Saving to: 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz'
Homo_sapiens.GRCh38.dna.primary_assemb 100%[============================================================================>] 840.39M 393KB/s in 24m 34s
2018-01-06 20:45:41 (584 KB/s) - 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz' saved [881214396/881214396]
$ gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
$ bgzip Homo_sapiens.GRCh38.dna.primary_assembly.fa
$ python
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyfaidx import Fasta
>>> pyfaidx_handle = Fasta('Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz')
>>> pyfaidx_handle['12'][111803961:111803962]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Development/pyfaidx/pyfaidx/__init__.py", line 698, in __getitem__
return self._fa.get_seq(self.name, start + 1, stop)[::step]
File "/Development/pyfaidx/pyfaidx/__init__.py", line 887, in get_seq
seq = self.faidx.fetch(name, start, end)
File "/Development/pyfaidx/pyfaidx/__init__.py", line 532, in fetch
seq = self.from_file(name, start, end)
File "/Development/pyfaidx/pyfaidx/__init__.py", line 574, in from_file
chunk_seq = self.file.read(chunk).decode()
File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
return data + self.read(size)
File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
return data + self.read(size)
...
...
...
yada, yada, yada ...
File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
return data + self.read(size)
File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 644, in read
self._load_block() # will reset offsets
File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 576, in _load_block
block_size, self._buffer = _load_bgzf_block(handle, self._text)
File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 411, in _load_bgzf_block
if magic != _bgzf_magic:
RuntimeError: maximum recursion depth exceeded in cmp
>>>
It appears that the line lengths are all 61 bytes, so this isn't a long line issue. I'll add some debug flags to pyfaidx
and see if maybe it's a bug in how I'm indexing the file and calculating offsets. @peterjc: what would happen if I tried to read from a virtual offset that doesn't exist (out of bounds read)?
Specifically I'm indexing a BGZF compressed file by calling the tell
method to get the current offset into the beginning of each FASTA record:
This returns a virtual offset which I then pass to the BGZF seek
method, reading from the known virtual offset of the record start, read until the end position, and slice the beginning before returning (can't seek directly to the requested start coordinate because it might be in a different BGZF block).
Is there something obviously wrong with this?
Ah, I think I know what's happening here. I'm calling BgzfReader.read()
and asking for the data between chromosome 12 locations [1: 111803962]. This is due to the way I've decided to avoid potentially adding/subtracting virtual offsets in my Faidx.fetch()
method. I think BgzfReader
will hit the interpreter recursion limit any time you ask it to read a large string from a file with a large number of newline characters, such as this example.
It's not trivial, but I could try to come up with a way to be more efficient with my BGZF reads. @peterjc: maybe I'm missing something but it seems impossible to seek/read a specific substring efficiently that may span multiple BGZF blocks unless you store the offset of each block indexed to the string.
Does this mean you can print out the problem offset so we can make a tiny test-case using this human FASTA file and just Bio/bgzf.py
? That's be very helpful as I right now (continuing our discussion from #131, without having played with the code), I am puzzled as to what might be going wrong here.
Note I suspect both .readline()
and .read()
may be affected by this recursion issue (I left a TODO comment in both methods).
@mdshw5 do you have the length in bytes of the string you want to read, as well as the (BGZF virtual) offset of its start point? That is quite well tested because that's how Biopython's SeqIO get_raw functionality works on BGZF compressed files. However, it did take some careful coding to ensure I could use .tell()
to get the virtual offset needed while building the index (i.e. avoiding any virtual pointer arithmetic because is not supported).
FWIW, I also ran into the same problem today, it seems.
% faidx -o "xxx.fa" -b "xxx.narrowPeak" "hg38.fa.gz"
Traceback (most recent call last):
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/bin/faidx", line 10, in <module>
sys.exit(main())
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 197, in main
write_sequence(args)
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 52, in write_sequence
for line in fetch_sequence(args, fasta, name, start, end):
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 69, in fetch_sequence
sequence = fasta[name][start:end]
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 789, in __getitem__
return self._fa.get_seq(self.name, start + 1, stop)[::step]
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 1015, in get_seq
seq = self.faidx.fetch(name, start, end)
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 613, in fetch
seq = self.from_file(name, start, end)
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 656, in from_file
chunk_seq = self.file.read(chunk).decode()
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
return data + self.read(size)
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
return data + self.read(size)
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
return data + self.read(size)
[Previous line repeated 983 more times]
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 648, in read
self._load_block() # will reset offsets
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 577, in _load_block
block_size, self._buffer = _load_bgzf_block(handle, self._text)
File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 412, in _load_bgzf_block
if magic != _bgzf_magic:
RecursionError: maximum recursion depth exceeded in comparison
I can confirm that the solution in https://github.com/biopython/biopython/issues/1701 fixes this issue. Due to my implementation there is still a large performance penalty for fetching small substrings near the end of a record, and I'll open an issue to remind me to explore a solution.
Great. You'll be looking forward to Biopython 1.73 then which will be the first release to include this fix.
Thanks everyone 👍
I'll add a version check around the BGZF code then.
:)
Hi,
I used
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
and queredpyfaidx_handle['12'][111803961:111803962]
. But the runtime is very slow and, with further investigation, I learned that I was getting maximum recursion error. I think using uncompressed.fa
is faster and safer as of now. Thoughts?Kind regards, Kwat