Closed knishiura3 closed 2 years ago
Can you try clustering with the --cluster-reassign
Parameter?
due to the cascaded cluster algorithm, clustern can end up merged that have members below the clustering thresholds. This parameter fixes the clustering afterwards.
the paarwise alignments among cluster members that alignall produces can be much more dissimilar though.
Thanks for the quick response.
I've tried clustering again with --cluster-reassign 1, but am still getting alignment lengths that appear to violate the clustering criteria, unless I'm misunderstanding something:
mmseqs cluster protfamDB clusterDB_reassign ~/scratch/tmp -s 6 --threads 24 --min-seq-id 0.4 -c 0.5 --cov-mode 2 --cluster-mode 0 --cluster-reassign 1
mmseqs align protfamDB protfamDB clusterDB_reassign alnDB_reassign -a --threads 24 --comp-bias-corr 0 -e 9007199254740992
mmseqs convertalis protfamDB protfamDB alnDB_reassign aln_to_clustreps.m8
awk '$4<50{print}' aln_to_clustreps.m8 | sort -nk4,4 | head
45656_3631 1137799_3214 0.950 20 1 0 79 98 423 442 2.607E-05 51
45656_3631 217204_6534 0.809 21 4 0 79 99 695 715 1.266E-04 49
45656_3631 2571105_3757 0.761 21 5 0 79 99 695 715 3.266E-04 48
45656_3631 668_9026 0.761 21 5 0 79 99 714 734 1.155E-03 46
489_1270 489_2691 1.000 21 0 0 1 21 1 21 3.857E-05 50
33968_38 33968_2958 0.833 24 4 0 92 115 128 151 2.956E-04 48
1550024_4392 1608583_1648 0.720 25 7 0 6 30 7 31 2.660E-04 48
2527978_7844 2527978_861 0.920 25 2 0 1 25 1 25 1.690E-05 52
45656_3631 191863_3100 0.880 25 3 0 75 99 282 306 8.777E-08 58
1128665_3036 709876_1980 0.653 26 9 0 43 68 211 236 2.419E-03 45
For example, in the first row of output above, the alignment length is 20 residues for a query sequence that is 100 residues long. My clustering parameters (-c 0.5 --cov-mode 2) should prevent a target sequence w/ alignment length <50 residues from joining this cluster, right?
On a related note, I noted this line in my clustering output log:
Combining cluster mode 0 in combination with coverage mode 2 can produce wrong results.
Please use --cov-mode 2
And found this comment from a previous issue:
The cluster mode 0 (set cover) requires a bi-directional coverage criteria. If it is used with unidirectional coverage than it pick representative sequences that do not fulfill the clustering criteria anymore.
Originally posted by @martin-steinegger in
https://github.com/soedinglab/MMseqs2/issues/218#issuecomment-580978693
It seems like my choice of parameters is at fault, and that I would probably have clusters that more closely obey the criteria if I used cov-mode 0 and cluster-mode 0 (set cover, query AND target coverage), or cov-mode 1 and cluster-mode 2 (CD-hit-like, target coverage). Am I understanding correctly?
Expected Behavior
Workflow: cluster -> alignall -> createtsv
I noticed there are cluster members in my output with very short alignment lengths. I was expecting the aligned length to be at least 50% of the cluster representative length, but that appears not to be the case.
Current Behavior
With the evalue cutoff for alignall set to a very large number, I'm getting really short alignment lengths that make me confused about how these cluster members were assigned to the cluster representative in the first place. For example on the first line below, the aligned length of query and target for the cluster representative 1288495_4011 and cluster member 758793_4580 appear to be 4 amino acids, which is much shorter than 0.5*65 = 32.5
(showing only 5 lines for brevity)
Steps to Reproduce (for bugs)
Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.
Your Environment
Include as many relevant details about the environment you experienced the bug in.