soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.39k stars 195 forks source link

how is index loaded into memory when performing multiple queries? #527

Open marcmk6 opened 2 years ago

marcmk6 commented 2 years ago

Hi,

I just want to ask a quick question. Say I create index for the uniref30_2103_db database with 3 splits: mmseqs createindex uniref30_2103_db tmp --split 3 and I perform 50 queries (in a single .fasta file) on it using the colabfold_search.sh script provided on https://colabfold.mmseqs.com. Will each of the three partial index be loaded into memory for ~50 times? Assume my RAM cannot hold more than one partial index and I don't use the colabfold_envdb.

In other words, I'm wondering if mmseqs works like either 1)

for query in queries_in_fasta:
    for partial_index_file in indices:
        search(query, partial_index_file)

or 2)

for partial_index_file in indices:
    for query in queries_in_fasta:
        search(query, partial_index_file)

In the first case I guess each partial index will be loaded into RAM from storage repeatedly for num_of_queries times which is slow, but for the second case it's just once.

Thanks

martin-steinegger commented 2 years ago

We do the second version. We load an index and then process all queries against the split.