Closed aswarren closed 6 years ago
:) This is a really big file. For large files you might try the Faidx
class instead. I just realized I don't point this out except at the bottom of the documentation above "support for compressed FASTA". I'll show an example of the memory overhead of the Fasta
object compared to Faidx
.
Here's the object graph for loading tests/data/genes.fasta
:
That's 1048 bytes for the index stored in a dictionary of NamedTuples
, plus 64 bytes for the Faidx
object for a total of 1112 bytes.
Here's the object graph for the same data using Fasta
:
Now you have a FastaRecord
instance for every entry in the index (this is what allows the dictionary and slicing behavior of Fasta
for a total of 9280 (FastaRecord
dictionaries) + 964 (FastaRecord
instances) + 64 (Fasta
instance) = 3160 bytes.
If you discount the overhead of the Fasta
or Faidx
instance and only look at the way the underlying data structures scale, you have approximately (because of the way python's dictionaries are implemented they grow stepwise) 3X more memory usage for Fasta
instances compared to Faidx
.
Oh, and its even worse than this because Fasta
instances must reference a Faidx
instance, since the underlying method for slicing a FastaRecord
is to call Faidx.fetch()
to do the actual work.
Most of the time the memory usage for storing the .fai
index using Fasta
instances won't be an issue, but you've definitely shown an edge case and may want to consider using Faidx
instead.
I added an optimization to FastaRecord
that brings the Fasta
example down to 1968 bytes, which is a ~1/3 reduction. Thanks for opening this issue and letting me know about your use case!
Thanks for the great explanation. I will adapt to using the Faidx structure and see how it works out. I suppose this is a non-issue then but I'll let you close it so the info can stay near the surface that much longer.
I'd say it's definitely something I should mention in the documentation now that people can use huge BGZF compressed files. It's also worth noting that my BGZF retrieval code is not as efficient as it probably could be - see #125
Hi, I recently built a bgzf and fai file for all of uniprotkb using htslib-1.7 and Samtools 1.5 respectively 24G Feb 2 14:21 uniprot_all.fasta.bgz 5.1G Feb 2 17:15 uniprot_all.fasta.bgz.fai 12M Feb 2 17:13 uniprot_all.fasta.bgz.gzi
When reading in the file with Fasta(self.fasta_file, as_raw=True, duplicate_action="first")
I am getting large memory usage: VIRT RES SHR S %CPU %MEM TIME+ COMMAND 55.972g 0.052t 15996 R 100.000 0.618 29:27.53 python
This might be slightly ridiculous. But just wanted to post in case helpful.