Closed yhgon closed 4 years ago
It looks like your issue is the memory usage for loading the FAI into system memory. Unfortunately in the current pyfaidx code it's impossible to iterate over the sequence names (keys) unless these are stored in memory, and if you have a file consisting of many small sequences with relatively large sequence names this limitation becomes quite noticeable. Note that this limitation is also present in samtools faidx/fqidx and there is a warning in its documentation.
However, it looks like you are iterating over each record in your file, and you may not need to access FASTA sequences by name. If that is the case you may use a method such as this to directly iterate over your FASTA file record by record.
I just make simple test script.
How should I avoid to load whole dataset in memory?
How could I configure to avoid to load whole dataset in memory? ( my dataset is larger than host system memory)