soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.32k stars 185 forks source link

Does input FASTA file have to be aligned? #841

Open laurien-0 opened 2 months ago

laurien-0 commented 2 months ago

I have run just the following commands

mmseqs createdb x_protseqs.fasta x_db mmseqs cluster x_db x_clust tmp --min-seq-id 0.9 mmseqs createtsv x_db x_db x_clust x_clust.tsv

My input x_protseqs.fasta is not aligned, and I got some slightly weird results from it Namely that when I aligned all the cluster representatives with an online MSA tool and plotted the PIM, I got some 99%s in there.

Is this just a quirk of the different alignment algorithms or should I be pre-aligning my data?

Thank you

milot-mirdita commented 2 months ago

The clustering does NOT take aligned input. Gaps would be turned to X characters and result in very odd alignments.

I am not sure I understand your issue with the weird alignments.

laurien-0 commented 2 months ago

Thank you, that is useful to know. IE I clustered at 70% and at 90% but with both - when I downloaded the representative sequences from each cluster and ran these in a MSA tool you would expect to see maximum roughly 70% and 90% pairwise comparisons right? The PIM is the percentage identity matrix. Instead I got values of up to 99% in both.