mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
459 stars 75 forks source link

BGZip slow performance near end of chromosomes #153

Open davmlaw opened 5 years ago

davmlaw commented 5 years ago

It can take over a minute to retrieve a few bases:

$ time faidx --no-rebuild Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz 7:117199563-117199564
>7:117199563-117199564
GG

real    1m8.888s
user    0m31.042s
sys 0m37.785s

Low coordinates are fine:

$time faidx --no-rebuild Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz 7:10000-11000 > /dev/null

real    0m0.295s
user    0m0.241s
sys 0m0.056s

You said in a previous issue:

there is still a large performance penalty for fetching small substrings near the end of a record, and I'll open an issue to remind me to explore a solution.

I can't find that issue, so am raising this one. Good luck!

Maarten-vd-Sande commented 2 years ago

Seems that the lookup time just scales with the distance from the "start" of a contig. I just quickly scanned the internals, can't say I fully understand, but it seems that this is due to the way bgzip is implemented in biopython:

https://github.com/biopython/biopython/blob/master/Bio/bgzf.py#L699

It seems to read the whole part before the contig you need...?

mdshw5 commented 2 years ago

@Maarten-vd-Sande this is definitely not due to the Bio.bgzf implementation and is definitely due to my incomplete implementation of virtual offset calculations from the start of each contig. I started work to fully support using the .gzi sidecar files in #164, but have not taken the time to complete the work. From the .fai index we can know how many bases (characters) to skip, and can seek directly to the requested region in an uncompressed FASTA. For BGZF compressed FASTA in current pyfaidx implementation can know which BGZF block to seek to the beginning of a contig, but without implementing logic to incorporate the .gzi (which tells us the internal BGZF block structure of a contig) the safest thing we can do is start reading from the beginning of the sequence. This is not ideal. The alternative is to seek to the BGZF block nearest to the coordinate of interest, and then start reading from the beginning of that block. This is the relevant code that I wrote 1.5 years ago:

https://github.com/mdshw5/pyfaidx/blob/f87877501488e8c47917d0083ccb4df66c003704/pyfaidx/__init__.py#L766-L776

You can see that I was still trying to figure out how this works, and never was able to make an entire round-trip (read a .gzi file and construct and write an identical .gzi file) without some slight errors.

Maarten-vd-Sande commented 2 years ago

@mdshw5 thanks for the reply, that makes sense! I guess I'll just load the whole fasta in memory for now :smile: