soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 200 forks source link

Unexpectedly short alignment lengths using alignall #585

Closed knishiura3 closed 2 years ago

knishiura3 commented 2 years ago

Expected Behavior

Workflow: cluster -> alignall -> createtsv

I noticed there are cluster members in my output with very short alignment lengths. I was expecting the aligned length to be at least 50% of the cluster representative length, but that appears not to be the case.

Current Behavior

With the evalue cutoff for alignall set to a very large number, I'm getting really short alignment lengths that make me confused about how these cluster members were assigned to the cluster representative in the first place. For example on the first line below, the aligned length of query and target for the cluster representative 1288495_4011 and cluster member 758793_4580 appear to be 4 amino acids, which is much shorter than 0.5*65 = 32.5

1288495_4011 1288495_4011 758793_4580 12 0.600 3.362E+08 78 82 120 41 45 65
1288495_4011 1288495_4011 458836_4783 12 0.227 3.362E+08 19 40 120 0 21 49 
1288495_4011 1288495_4011 1777137_4763 12 0.666 3.362E+08 25 30 120 57 62 66
1288495_4011 1288495_4011 1968433_3027 13 0.363 1.888E+08 52 71 120 71 92 372
1288495_4011 1288495_4011 268404_3908 14 0.444 1.059E+08 23 31 120 166 174 379

(showing only 5 lines for brevity)

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

mmseqs createdb protfam.fasta protfamDB
mmseqs cluster --threads 24 --min-seq-id 0.4 -c 0.5 --cov-mode 2 --cluster-mode 0 protfamDB clusterDB tmp
mmseqs createsubdb <(printf '8277758') clusterDB 1288495_4011.subdb
mmseqs alignall protfamDB 1288495_4011.subdb 1288495_4011.alnDB --comp-bias-corr 0 --cov-mode 2 --alignment-mode 3 --threads 24 -e 9007199254740992
mmseqs createtsv protfamDB protfamDB 1288495_4011.alnDB 1288495_4011.alnDB.tsv
OFS="\t" awk 'FNR==NR{a[$1]=$2;next}{$3=a[$3]; print}' protfamDB.lookup 1288495_4011.alnDB.tsv > 1288495_4011_renamed.tsv
awk -F' ' '!seen[$2,$3]++ && !seen[$3,$2]++' 1288495_4011_renamed.tsv > 1288495_4011_dedupe.tsv
sort -grk6,6 1288495_4011_dedupe.tsv | awk '$2=="1288495_4011" {print $0}'

Your Environment

Include as many relevant details about the environment you experienced the bug in.

milot-mirdita commented 2 years ago

Can you try clustering with the --cluster-reassign Parameter?

due to the cascaded cluster algorithm, clustern can end up merged that have members below the clustering thresholds. This parameter fixes the clustering afterwards.

the paarwise alignments among cluster members that alignall produces can be much more dissimilar though.

knishiura3 commented 2 years ago

Thanks for the quick response.

I've tried clustering again with --cluster-reassign 1, but am still getting alignment lengths that appear to violate the clustering criteria, unless I'm misunderstanding something:

mmseqs cluster protfamDB clusterDB_reassign ~/scratch/tmp -s 6 --threads 24 --min-seq-id 0.4 -c 0.5 --cov-mode 2 --cluster-mode 0 --cluster-reassign 1
mmseqs align protfamDB protfamDB clusterDB_reassign alnDB_reassign -a --threads 24 --comp-bias-corr 0 -e 9007199254740992
mmseqs convertalis protfamDB protfamDB alnDB_reassign aln_to_clustreps.m8
awk '$4<50{print}' aln_to_clustreps.m8 | sort -nk4,4 | head
45656_3631      1137799_3214    0.950   20      1       0       79      98      423     442     2.607E-05       51
45656_3631      217204_6534     0.809   21      4       0       79      99      695     715     1.266E-04       49
45656_3631      2571105_3757    0.761   21      5       0       79      99      695     715     3.266E-04       48
45656_3631      668_9026        0.761   21      5       0       79      99      714     734     1.155E-03       46
489_1270        489_2691        1.000   21      0       0       1       21      1       21      3.857E-05       50
33968_38        33968_2958      0.833   24      4       0       92      115     128     151     2.956E-04       48
1550024_4392    1608583_1648    0.720   25      7       0       6       30      7       31      2.660E-04       48
2527978_7844    2527978_861     0.920   25      2       0       1       25      1       25      1.690E-05       52
45656_3631      191863_3100     0.880   25      3       0       75      99      282     306     8.777E-08       58
1128665_3036    709876_1980     0.653   26      9       0       43      68      211     236     2.419E-03       45

For example, in the first row of output above, the alignment length is 20 residues for a query sequence that is 100 residues long. My clustering parameters (-c 0.5 --cov-mode 2) should prevent a target sequence w/ alignment length <50 residues from joining this cluster, right?

On a related note, I noted this line in my clustering output log:

Combining cluster mode 0 in combination with coverage mode 2 can produce wrong results.
Please use --cov-mode 2

And found this comment from a previous issue: The cluster mode 0 (set cover) requires a bi-directional coverage criteria. If it is used with unidirectional coverage than it pick representative sequences that do not fulfill the clustering criteria anymore. Originally posted by @martin-steinegger in https://github.com/soedinglab/MMseqs2/issues/218#issuecomment-580978693

It seems like my choice of parameters is at fault, and that I would probably have clusters that more closely obey the criteria if I used cov-mode 0 and cluster-mode 0 (set cover, query AND target coverage), or cov-mode 1 and cluster-mode 2 (CD-hit-like, target coverage). Am I understanding correctly?