Closed KwatMDPhD closed 3 years ago
Follow up: samtools faidx
fails using the index created from pyfaidx
as well.
I think this is a good idea, and the work to support this is:
Faidx
logic to use the .fai and .gzi indices in combination to recreate virtual offsets (I think this is what samtools faidx is doing)I checked the .fai files created from pyfaidx and samtools, and they are the same. Also, samtools must have .gzi to work. Hope this information helps.
Closing this issue assuming that #1701 closes this issue. Thanks :)
Are you able to test whether the issue is fixed? I’ll look into it as well, but I believe our BGZF indices may still be incompatible with samtools.
Specifically the recursion issue in biopython is fixed, but I’d like to implement .gzi creation and a more efficient sequence retrieval in pyfaidx for BGZF files. Currently pyfaidx must fetch from the beginning of a record to the user specified end coordinate and returns the subset sequence from memory. This isn’t as efficient as samtools, and the limitation is in understanding how samtools generates virtual offsets from the .gzi to get the offset into the start coordinate.
I see. When this is in place, please let us know. Thanks @mdshw5
Re-opening to work on this issue before the end of the year.
Any progress on this?
@IPetrik I did do some work on this earlier this year, but never made something that works. I believe I pushed what work I had here: https://github.com/mdshw5/pyfaidx/commit/db7f140ce97905d22c2280601f5234dc67711669. I'll take a look on my local machine and see if there's anything else. I'd really like to get this feature working properly so if you've got ideas please share.
@IPetrik Forget me previous comment. I have some work on my local machine that's completely different. I'll update the samtools_bgzf_compatibility
branch with what I have.
I've opened a PR with the work for this issue in #164. If I have some time this summer I'll come back and keep working - it doesn't seem like there's much left to do except finish testing the GZI packing/unpacking and implementing methods to create and read the on-disk format.
Hi,
When I do
samtools faidx file.fa.gz
and then try to use the samefile.fa.gz
file forpyfaidx
, I get an error saying thatfile.fa.gz
is not a valid BGZF file. But when I delete thefile.fa.gz.fai
and then usepyfaidx
, the error disappears. I believe this is because the.fai
pyfaidx
creates is different from.fai
samtools
creates.If this behavior is real, Is it possible to unify the
.fai
ofpyfaidx
andsamtools
? Thoughts?Kind regards, Kwat