soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

All-vs-All alignments with fake prefilter give unexpected sequence identities #498

Open fteufel opened 2 years ago

fteufel commented 2 years ago

Hi,

I'm using MMseqs2 for all-vs-all alignments. As indicated in the user guide pdf, I'm using the bash script to perform a fake prefiltering step before aligning. That seems to work, as all alignments are computed, but the results are a bit unexpected:

>Q8CD91
EERVVHWYFKLLDKNSSGDIGKKEIKPFKRFLRKKSKPKKCVKKFVEYCDMNNDKSITVQELMGCLGVTREEGKANTRKRHTPRGNAESSSSNRQPRKQG
>Q95JC9
GPPPPGPAPPGARPPPGPPPPGPPPPGPAPPGARPPPGPPPPGPPPPGPAPPGARPPPGPPPPGPPPPGPAPPGARPPPGPPPPPPGPSPPRPPPGPPPQ

I didn't change much of the defaults

mmseqs createdb --dbtype 1 dummyfasta.fasta seq_db
mmseqs_fake_prefilter.sh seq_db seq_db pref
mmseqs align seq_db seq_db pref align_db --alignment-mode 3 -e inf --min-seq-id 0.3
mmseqs convertalis seq_db seq_db align_db check_alignments.tab
query         target        fident  alnlen  mismatch gapopen qstart  qend    tstart  tend    evalue          bits
Q8CD91        Q8CD91        1.000   100     0       0        1       100     1       100     1.707E-71       202
Q8CD91        Q95JC9        1.000   2       0       0        83      84      91      92      4.610E+01       9
Q95JC9        Q95JC9        1.000   100     0       0        1       100     1       100     7.088E-35       108
Q95JC9        Q8CD91        0.333   3       1       0        11      13      21      23      1.508E+02       7

The second alignment's fident makes no sense to me. I understand that the e-value is high too, but I really want all pairwise % identities returned so I don't use it for filtering. Do I need to use a different configuration to make this work, or is mmseqs2 just not suitable for this all-vs-all identity task?

milot-mirdita commented 2 years ago

That seems about right? It aligned two residues successfully (from 83 to 84). You might want to demand some coverage thresholds (-c/-cov-mode) or a minimum aln length threshold (--min-aln-len).

fteufel commented 2 years ago

Thanks for pointing out this was correct! I did not realize that this was just a denominator issue. --seq-id-mode was what I needed to change, as the default does not make much sense for this application.