mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
459 stars 75 forks source link

Create or load htslib .fai and .gzi index files when using BGZF files #126

Closed KwatMDPhD closed 3 years ago

KwatMDPhD commented 7 years ago

Hi,

When I do samtools faidx file.fa.gz and then try to use the same file.fa.gz file for pyfaidx, I get an error saying that file.fa.gz is not a valid BGZF file. But when I delete the file.fa.gz.fai and then use pyfaidx, the error disappears. I believe this is because the .fai pyfaidx creates is different from .fai samtools creates.

If this behavior is real, Is it possible to unify the .fai of pyfaidx and samtools? Thoughts?

Kind regards, Kwat

KwatMDPhD commented 7 years ago

Follow up: samtools faidx fails using the index created from pyfaidx as well.

screen shot 2017-09-13 at 05 04 14
mdshw5 commented 7 years ago

I think this is a good idea, and the work to support this is:

  1. Store compressed file offsets, which is what it looks like samtools is storing in its .fai, instead of virtual offsets, which I'm doing in this library
  2. Write methods to build and load the .gzi index file that htslib creates, which stores the number of gzipped blocks, and (compressed offset, uncompressed offset) for each block start byte (currently described in https://github.com/samtools/htslib/issues/473)
  3. Change Faidx logic to use the .fai and .gzi indices in combination to recreate virtual offsets (I think this is what samtools faidx is doing)
KwatMDPhD commented 7 years ago

I checked the .fai files created from pyfaidx and samtools, and they are the same. Also, samtools must have .gzi to work. Hope this information helps.

KwatMDPhD commented 6 years ago

Closing this issue assuming that #1701 closes this issue. Thanks :)

mdshw5 commented 6 years ago

Are you able to test whether the issue is fixed? I’ll look into it as well, but I believe our BGZF indices may still be incompatible with samtools.

mdshw5 commented 6 years ago

Specifically the recursion issue in biopython is fixed, but I’d like to implement .gzi creation and a more efficient sequence retrieval in pyfaidx for BGZF files. Currently pyfaidx must fetch from the beginning of a record to the user specified end coordinate and returns the subset sequence from memory. This isn’t as efficient as samtools, and the limitation is in understanding how samtools generates virtual offsets from the .gzi to get the offset into the start coordinate.

KwatMDPhD commented 6 years ago

I see. When this is in place, please let us know. Thanks @mdshw5

mdshw5 commented 5 years ago

Re-opening to work on this issue before the end of the year.

IPetrik commented 4 years ago

Any progress on this?

mdshw5 commented 4 years ago

@IPetrik I did do some work on this earlier this year, but never made something that works. I believe I pushed what work I had here: https://github.com/mdshw5/pyfaidx/commit/db7f140ce97905d22c2280601f5234dc67711669. I'll take a look on my local machine and see if there's anything else. I'd really like to get this feature working properly so if you've got ideas please share.

mdshw5 commented 4 years ago

@IPetrik Forget me previous comment. I have some work on my local machine that's completely different. I'll update the samtools_bgzf_compatibility branch with what I have.

mdshw5 commented 4 years ago

I've opened a PR with the work for this issue in #164. If I have some time this summer I'll come back and keep working - it doesn't seem like there's much left to do except finish testing the GZI packing/unpacking and implementing methods to create and read the on-disk format.