Large faidx. Huge memory usage?

aswarren commented 6 years ago

Hi, I recently built a bgzf and fai file for all of uniprotkb using htslib-1.7 and Samtools 1.5 respectively 24G Feb 2 14:21 uniprot_all.fasta.bgz 5.1G Feb 2 17:15 uniprot_all.fasta.bgz.fai 12M Feb 2 17:13 uniprot_all.fasta.bgz.gzi

When reading in the file with Fasta(self.fasta_file, as_raw=True, duplicate_action="first")

I am getting large memory usage: VIRT RES SHR S %CPU %MEM TIME+ COMMAND 55.972g 0.052t 15996 R 100.000 0.618 29:27.53 python

This might be slightly ridiculous. But just wanted to post in case helpful.

mdshw5 commented 6 years ago

:) This is a really big file. For large files you might try the Faidx class instead. I just realized I don't point this out except at the bottom of the documentation above "support for compressed FASTA". I'll show an example of the memory overhead of the Fasta object compared to Faidx.

Here's the object graph for loading tests/data/genes.fasta: faidx

That's 1048 bytes for the index stored in a dictionary of NamedTuples, plus 64 bytes for the Faidx object for a total of 1112 bytes.

Here's the object graph for the same data using Fasta: fasta

Now you have a FastaRecord instance for every entry in the index (this is what allows the dictionary and slicing behavior of Fasta for a total of 9280 (FastaRecord dictionaries) + 964 (FastaRecord instances) + 64 (Fasta instance) = 3160 bytes.

If you discount the overhead of the Fasta or Faidx instance and only look at the way the underlying data structures scale, you have approximately (because of the way python's dictionaries are implemented they grow stepwise) 3X more memory usage for Fasta instances compared to Faidx.

Oh, and its even worse than this because Fasta instances must reference a Faidx instance, since the underlying method for slicing a FastaRecord is to call Faidx.fetch() to do the actual work.

Most of the time the memory usage for storing the .fai index using Fasta instances won't be an issue, but you've definitely shown an edge case and may want to consider using Faidx instead.

mdshw5 commented 6 years ago

I added an optimization to FastaRecord that brings the Fasta example down to 1968 bytes, which is a ~1/3 reduction. Thanks for opening this issue and letting me know about your use case!

aswarren commented 6 years ago

Thanks for the great explanation. I will adapt to using the Faidx structure and see how it works out. I suppose this is a non-issue then but I'll let you close it so the info can stay near the surface that much longer.

mdshw5 commented 6 years ago

I'd say it's definitely something I should mention in the documentation now that people can use huge BGZF compressed files. It's also worth noting that my BGZF retrieval code is not as efficient as it probably could be - see #125

mdshw5 / pyfaidx

Large faidx. Huge memory usage? #137