Open davmlaw opened 5 years ago
Seems that the lookup time just scales with the distance from the "start" of a contig. I just quickly scanned the internals, can't say I fully understand, but it seems that this is due to the way bgzip is implemented in biopython:
https://github.com/biopython/biopython/blob/master/Bio/bgzf.py#L699
It seems to read the whole part before the contig you need...?
@Maarten-vd-Sande this is definitely not due to the Bio.bgzf implementation and is definitely due to my incomplete implementation of virtual offset calculations from the start of each contig. I started work to fully support using the .gzi
sidecar files in #164, but have not taken the time to complete the work. From the .fai
index we can know how many bases (characters) to skip, and can seek directly to the requested region in an uncompressed FASTA. For BGZF compressed FASTA in current pyfaidx
implementation can know which BGZF block to seek to the beginning of a contig, but without implementing logic to incorporate the .gzi
(which tells us the internal BGZF block structure of a contig) the safest thing we can do is start reading from the beginning of the sequence. This is not ideal. The alternative is to seek to the BGZF block nearest to the coordinate of interest, and then start reading from the beginning of that block. This is the relevant code that I wrote 1.5 years ago:
You can see that I was still trying to figure out how this works, and never was able to make an entire round-trip (read a .gzi
file and construct and write an identical .gzi
file) without some slight errors.
@mdshw5 thanks for the reply, that makes sense! I guess I'll just load the whole fasta in memory for now :smile:
It can take over a minute to retrieve a few bases:
Low coordinates are fine:
You said in a previous issue:
I can't find that issue, so am raising this one. Good luck!