mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
459 stars 75 forks source link

bgzipped file does not work #124

Closed KwatMDPhD closed 5 years ago

KwatMDPhD commented 7 years ago

I downloaded ftp://ftp.ensembl.org/pub/release-89/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz.

I converted gzip to bgzip:

$ gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
$ bgzip Homo_sapiens.GRCh38.dna.primary_assembly.fa

Then I used pyfaidx:

import pyfaidx
fasfa_handle= pyfaidx.Fasta('Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz')
fasfa_handle['1'][10000:10001].seq

Finally, I got an error:

ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'\x16]e\x13'; handle.tell() now says 10104

I can fork the repo and disable the bgzf support. What do you think?

mdshw5 commented 7 years ago

This is just a corrupted BGZF file. Are you sure that bgzip actually over-wrote the existing file named Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz? As far as forking the repo, I'm not sure what you're wanting to do, but BGZF support does in fact work and has a whole test suite (https://github.com/mdshw5/pyfaidx/blob/master/tests/test_Fasta_bgzip.py) that's passing, and raises errors for class methods that are not supported when using BGZF. I'll try to reproduce the error you're seeing, and if so go from there.

KwatMDPhD commented 7 years ago

I just tested again and you're right that BGZF works.

screen shot 2017-09-13 at 01 36 46

Thanks for the message. Perhaps the warnings for BGZF can be dropped?

mdshw5 commented 7 years ago

Yeah I pulled your changes and version 5.0.1 does not raise this warning.

KwatMDPhD commented 7 years ago

I just figured out the source of my error. When I do samtools faidx file.fa.gz and then try to use the same file.fa.gz file for pyfaidx, then I get the bug I described.

davmlaw commented 5 years ago

I just figured out the source of my error. When I do samtools faidx file.fa.gz and then try to use the same file.fa.gz file for pyfaidx, then I get the bug I described.

FYI I had the same issue, and to fix it I did:

touch Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
faidx Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz 11:1-1 # create new index and return garbage coordinate
KwatMDPhD commented 5 years ago

I stopped using pyfaidx. I now call command line samtools and parse its output, a more general approach.

davmlaw commented 5 years ago

pysam has FastaFile which has a fetch method which seems to do the trick, and avoids the process calling overhead of calling the command line tool.

I came here from PyFasta which required me to duplicate references and now says to use pyfaidx.

KwatMDPhD commented 5 years ago

I see. Also, if you do change your programming language, you have to find a new PyFasta or any library replacement. I had to go through this bitterness and ended up using the purest command line samtools and parsing its result.

davmlaw commented 5 years ago

I think you accidentally reopened this issue. It's fixed