mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
449 stars 75 forks source link

question for Iterator mode #162

Closed yhgon closed 4 years ago

yhgon commented 4 years ago

I just make simple test script.

How should I avoid to load whole dataset in memory?

  1. in the first run, it make index as expect. my test dataset is 16GB. it generate 1.6G fai file and during generation, memory workload is no problem.
  2. however, after finish to make fai, it start to load whole dataset to memory.

How could I configure to avoid to load whole dataset in memory? ( my dataset is larger than host system memory)

def print_fasta(filename, max_id=9999999999, print_iter =10000 ):
    from pyfaidx import Fasta
    proteins = Fasta(filename )
    keys = proteins.keys()   

    for i, key in enumerate( keys  ):   
        if i> max_id:
            return        
        seq = proteins['{}'.format(key) ]
        name = proteins['{}'.format(key) ].name
        long_name = proteins['{}'.format(key) ].long_name

        if i% print_iter ==0 :
            print( "DEBUG : iter {:16d}  {:d} {}  {} {} ".format(i, len(seq) , name , long_name,  key, seq[0:10]) )   
    return
mdshw5 commented 4 years ago

It looks like your issue is the memory usage for loading the FAI into system memory. Unfortunately in the current pyfaidx code it's impossible to iterate over the sequence names (keys) unless these are stored in memory, and if you have a file consisting of many small sequences with relatively large sequence names this limitation becomes quite noticeable. Note that this limitation is also present in samtools faidx/fqidx and there is a warning in its documentation.

However, it looks like you are iterating over each record in your file, and you may not need to access FASTA sequences by name. If that is the case you may use a method such as this to directly iterate over your FASTA file record by record.

yhgon commented 4 years ago

thanks for recommend the method . I make sequential reading multiline fasta file to generate index similar as faidx and fetch the seq from query. my method is slower than faidx but use only few MB for indexing and extract Seq data what I want to do.