mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
459 stars 75 forks source link

ImportError: BioPython >= 1.73 must be installed to read block gzip files. #147

Closed fungs closed 5 years ago

fungs commented 5 years ago

I'm trying to access a bgzipped FASTA file. Although the message suggests that there is a BioPython version 1.73, this release does not exist by know. Is this the expected behavior?

mdshw5 commented 5 years ago

Yeah, this is expected. The 1.73 release will have a fix for #125, but there is a workaround. I was waiting on the 1.73 release, but being a large project @peterjc needs quite a bit of time and testing between releases. You can use the current development biopython version, installable:

pip install -e http+git://github.com/biopython/biopython.git#egg=biopython

You could also use an older pyfaidx version (https://github.com/mdshw5/pyfaidx/releases/tag/v0.5.4.2), but you may run into a RecursionError using the current biopython BGZF code.

The preferred solution is to wait (a few days/weeks) for the biopython 1.73 release, which seems imminent.

peterjc commented 5 years ago

Imminent yes, but unlikely to be this week as I'm off site for meetings until Friday.

fungs commented 5 years ago

It's ok for me, I will give it a try with the dev version of BioPython. The error message is confusing though, because I was trying to update BioPython through Conda and pip and wondered why I was always getting an old version :)

A related question since you seem to be the right guys to ask: FASTA+FAI support seems to be widespread but is there other software implementing the BGZIP version for random access of compressed FASTA? I was hoping for compatible implementations in, for instance, Seqan, but found nothing.

mdshw5 commented 5 years ago

The only python API to BGZF compressed and indexed FASTA files that I know of is pysam. Since it wraps the reference BGZF code in htslib, you can fetch regions using its FastaFile object. It doesn't appear that indexing is exposed through pysam, so you must use samtools to generate an index first. I'll also point out that pyfaidx does not generate or utilize the .gzi file, which htslib (and pysam) require, and I would like to incorporate this information in the future for more efficient (I think?) queries. If anyone wants to help, see #126.

peterjc commented 5 years ago

Biopython supports random access to BGZF compressed sequence files including FASTA (for pulling out entire sequence records), but does not currently support the FAI index format. See Bio.SeqIO.index (in memory index) and Bio.SeqIO.index_db (reusable SQLite3 based index on disk). See also https://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

mdshw5 commented 5 years ago

I’m closing this issue as biopython 1.73 has been released.