soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 192 forks source link

Question: efficient implementation of mmseqs easy-rbh for large datasets #777

Open shimooper opened 10 months ago

shimooper commented 10 months ago

Hi, I want to run "mmseqs easy-rbh" on all pairs of a large dataset of genomes (let's say 100 genomes, so 10,000 pairs). I tested on 1 pairs and it took 27 seconds to run.

I thought to use the modules instead of the easy workflow to make it faster - what I did is: mmseqs createdb mmseqs createdb mmseqs rbh alnOffsettedDB tmpDir mmseqs convertalis alnOffsettedDB output_m8

The advantage of this is that I do only once the "createdb" operations for all genomes. And indeed, the rbh+convertalis commands took 17 seconds to run, which is faster than 27 seconds, and very significant when multiplied by 10,000 genome pairs.

But the problem is that the outputs are different - the output of "easy-rbh" had 100 more pairs than the output of my implementation.

Can you please help me fix my implementation? or explain how to use the easy-rbh but with pre-computed databases?

thanks

milot-mirdita commented 10 months ago

Is this with the latest release? We had a few changes a while ago that are only officially part as of release 15 Maybe the issue is already fixed. I don't see anything immediately wrong.

Could you post the full command line output of both calls?

shimooper commented 10 months ago

Hi, thanks for the quick response!

I use the latest mmseqs package (15.6f452). I attach my:

easy_rbh.log easy_rbh.m8.txt easy_rbh.py.txt my_rbh.log my_rbh.m8.txt my_rbh.py.txt UP000000425_122586_protein.fasta.txt UP000000429_85962_protein.fasta.txt

elileka commented 10 months ago

Hi,

There is nothing logically wrong with your script, however there are a couple of parameters whose default value is different from the value set by the easy-rbh workflow. One of them is the sensitivity/speed tradeoff -s, which is set by the easy-rbh to 5.7 (highly sensitive) but defaults to 4 when calling the rbh module directly. When I re-ran your rbh command as follows: mmseqs rbh UP000000425_122586_protein UP000000429_85962_protein alnOffseted.db tmp -s 5.7 I could get it to produce 590 results, as in the easy-rbh workflow.

So what should you do now? Try to examine the results using different sensitivity values (maybe even lower than 4). If lower sensitivities cover your need (e.g., in cases of close homologs), then you can use this parameter to speed up your code even more.

Another thing: I see that the easy-rbh also differs from two consecutive calls to createdb, followed by a call to rbh because of the shuffle parameter. In this specific case, changing this parameter (w/out changing max-seq) does not have an effect (i.e., I could get the number of results to change only through -s), but maybe @milot-mirdita has some good advice on how to set this parameter in the case of DYI-rbh. Milot, in the easy-rbh, it seems like the parameter setting is having a hard time as it gives the following warnings:

Shuffle database cannot be combined with --createdb-mode 0 We recompute with --shuffle 0 Converting sequences Multiline fasta can not be combined with --createdb-mode 0 We recompute with --createdb-mode 1

Maybe it is worth checking.

shimooper commented 9 months ago

Thanks a lot for the response! about the -s parameter - is it possible to calculate it according to the desired identity_percent cutoff? Let's say I want to get all results with identity_percent > 40%, is there a way to calculate the required value for "-s" parameter?

(The reason I ask this is that I would like to set this parameter in my pipeline in a dynamic way according to a user-defined identity_percent cutoff)

Thanks again

elileka commented 9 months ago

Hi,

I see that in the Cluster module of MMseqs2, there is a function that converts sequence identity to a corresponding sensitivity value: `

float setAutomaticThreshold(float seqId) {
    float sens;
    if (seqId <= 0.3) {
        sens = 6;
    } else if (seqId > 0.8) {
        sens = 1.0;
    } else {
        sens = 1.0 + (1.0 * (0.7 - seqId) * 10);
    }
    return sens; }

`

So if you call it with seqId = 0.4, it would return -s as 4.

But there are some caveats to this:

  1. This function is used for clustering, where the meaning of a slightly wrong -s value is slightly worse clustering, but not loss of information (as all the input will be clustered, no matter what -s value is selected). This is not the case here and a too low -s might cause losing hits.
  2. The function is not based on thorough benchmarking, it is more a "rule of thumb".

Therefore, we recommend you:

  1. Easy: select a slightly higher -s than the seqId you are interested in, to be on the safe side
  2. Involved: keep an eye on the results and see if big things get lost.
  3. More involved: benchmark for your purposes. For example, create a dataset of homologous pairs at 40% id of taxa relevant for your study and see how many can be recovered using various -s