soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

-clustering with -similarity-type 1 #414

Open s-devos opened 3 years ago

s-devos commented 3 years ago

Expected Behavior

I am trying to cluster based on similarity (--similarity-type 1) instead of the default identity (--similarity-type 2)

Current Behavior

Using either parameter gives the same clustering output.

Steps to Reproduce (for bugs)

Input: artificial6.txt commands: mmseqs createdb artificial6.fasta in/DB_in mmseqs cluster in/DB_in clu/sim1_clu clu/sim1_tmp --cov-mode 0 -c 0.8 --cluster-mode 0 --min-seq-id 0.9 --mask 0 -k 5 --spaced-kmer-pattern 110111 --similarity-type 1 or mmseqs cluster in/DB_in clu/sim2_clu clu/sim2_tmp --cov-mode 0 -c 0.8 --cluster-mode 0 --min-seq-id 0.9 --mask 0 -k 5 --spaced-kmer-pattern 110111 --similarity-type 2

Both results lead to identical .tsv files

Update: adding --comp-bias-corr 0 gives similar results

MMseqs Output (for bugs)

similarity type 1: Create directory clu/sim1_tmp cluster in/DB_in clu/sim1_clu clu/sim1_tmp --cov-mode 0 -c 0.8 --cluster-mode 0 --min-seq-id 0.9 --mask 0 -k 5 --spaced-kmer-pattern 110111 --similarity-type 1

MMseqs Version: a19f5a526012b849a723935acf56d10f39d24611 Substitution matrix nucl:nucleotide.out,aa:blosum62.out Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 4 k-mer length 5 k-score 2147483647 Alphabet size nucl:5,aa:21 Max sequence length 65535 Max results per query 20 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.8 Coverage mode 0 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 0 Mask lower case residues 0 Minimum diagonal score 15 Include identical seq. id. false Spaced k-mers 1 Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Spaced k-mer pattern 110111 Local temporary path Threads 16 Compressed 0 Verbosity 3 Add backtrace false Alignment mode 3 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0.9 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max reject 2147483647 Max accept 2147483647 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Zdrop 40 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Cluster mode 0 Max connected component depth 1000 Similarity type 1 Single step clustering false Cascaded clustering steps 3 Cluster reassign false Remove temporary files false Force restart with latest tmp false MPI runner k-mers per sequence 21 Scale k-mers per sequence nucl:0.200,aa:0.000 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false

Set cluster sensitivity to -s 1.000000 Set cluster iterations to 1 linclust in/DB_in clu/sim1_tmp/7842071913732967198/clu_redundancy clu/sim1_tmp/7842071913732967198/linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 1 --threads 16 --compressed 0 -v 3 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --alph-size nucl:5,aa:13 --kmer-per-seq 21 --spaced-kmer-mode 1 --spaced-kmer-pattern 110111 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-reuse 0

kmermatcher in/DB_in clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.9 --kmer-per-seq 21 --spaced-kmer-mode 1 --spaced-kmer-pattern 110111 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 16 --compressed 0 -v 3

kmermatcher in/DB_in clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.9 --kmer-per-seq 21 --spaced-kmer-mode 1 --spaced-kmer-pattern 110111 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 16 --compressed 0 -v 3

Database size: 164 type: Aminoacid Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Generate k-mers list for 1 split [=================================================================] 100.00% 164 0s 52ms Sort kmer 0h 0m 0s 0ms Sort by rep. sequence 0h 0m 0s 0ms Time for fill: 0h 0m 0s 0ms Time for merging to pref: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 170ms rescorediagonal in/DB_in in/DB_in clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 164 0s 28ms Time for merging to pref_rescore1: 0h 0m 0s 3ms Time for processing: 0h 0m 0s 84ms clust in/DB_in clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref_rescore1 clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 1 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 164 0s 12ms Sort entries Find missing connections Found 10 new connections. Reconstruct initial order [=================================================================] 100.00% 164 0s 12ms Add missing connections [=================================================================] 100.00% 164 0s 2ms

Time for read in: 0h 0m 0s 114ms Total time: 0h 0m 0s 153ms

Size of the sequence database: 164 Size of the alignment database: 164 Number of clusters: 154

Writing results 0h 0m 0s 0ms Time for merging to pre_clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 156ms createsubdb clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/order_redundancy in/DB_in clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms createsubdb clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/order_redundancy clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms filterdb clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref_filter1 clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref_filter2 --filter-file clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/order_redundancy --threads 16 --compressed 0 -v 3

Filtering using file(s) [=================================================================] 100.00% 154 0s 34ms Time for merging to pref_filter2: 0h 0m 0s 4ms Time for processing: 0h 0m 0s 99ms rescorediagonal clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/input_step_redundancy clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/input_step_redundancy clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref_filter2 clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 154 0s 7ms Time for merging to pref_rescore2: 0h 0m 0s 1ms ] 49.67% 77 eta 0s Time for processing: 0h 0m 0s 37ms align clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/input_step_redundancy clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/input_step_redundancy clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pref_rescore2 clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 154 type: Aminoacid Target database size: 154 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 154 0s 42ms Time for merging to aln: 0h 0m 0s 4ms 154 alignments calculated 154 sequence pairs passed the thresholds (1.000000 of overall calculated) 1.000000 hits per query sequence Time for processing: 0h 0m 0s 74ms clust clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/input_step_redundancy clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/aln clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/clust --cluster-mode 0 --max-iterations 1000 --similarity-type 1 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 154 0s 12ms Sort entries Find missing connections Found 0 new connections. Reconstruct initial order [=================================================================] 100.00% 154 0s 15ms Add missing connections [=================================================================] 100.00% 154 0s 1ms

Time for read in: 0h 0m 0s 84ms Total time: 0h 0m 0s 114ms

Size of the sequence database: 154 Size of the alignment database: 154 Number of clusters: 154

Writing results 0h 0m 0s 0ms Time for merging to clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 119ms mergeclusters in/DB_in clu/sim1_tmp/7842071913732967198/clu_redundancy clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/pre_clust clu/sim1_tmp/7842071913732967198/linclust/14523879757540758787/clust --threads 16 --compressed 0 -v 3

Clustering step 1 [=================================================================] 100.00% 154 0s 10ms Clustering step 2 [=================================================================] 100.00% 154 0s 37ms Write merged clustering [=================================================================] 100.00% 164 0s 48ms Time for merging to clu_redundancy: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 68ms createsubdb clu/sim1_tmp/7842071913732967198/clu_redundancy in/DB_in clu/sim1_tmp/7842071913732967198/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms prefilter clu/sim1_tmp/7842071913732967198/input_step_redundancy clu/sim1_tmp/7842071913732967198/input_step_redundancy clu/sim1_tmp/7842071913732967198/pref_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 1 -k 5 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --diag-score 0 --exact-kmer-matching 0 --mask 0 --mask-lower-case 0 --min-ungapped-score 0 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --spaced-kmer-pattern 110111 --threads 16 --compressed 0 -v 3

Query database size: 154 type: Aminoacid Estimated memory consumption: 514M Target database size: 154 type: Aminoacid Index table k-mer threshold: 148 at k-mer size 5 Index table: counting k-mers [=================================================================] 100.00% 154 0s 6ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 154 0s 2ms Index statistics Entries: 221 DB size: 24 MB Avg k-mer size: 0.000069 Top 10 k-mers MKCFP 6 KYFPQ 6 HFVRF 4 CPIVP 4 WWWWW 4 RFDHR 3 WWFWW 3 FWWFW 2 WWWFW 2 MEMYY 2 Time for index table init: 0h 0m 0s 117ms Process prefiltering step 1 of 1

k-mer similarity threshold: 148 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 154 Target db start 1 to 154 [=================================================================] 100.00% 154 0s 46ms

1.177440 k-mers per position 3 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step0: 0h 0m 0s 9ms Time for processing: 0h 0m 0s 845ms align clu/sim1_tmp/7842071913732967198/input_step_redundancy clu/sim1_tmp/7842071913732967198/input_step_redundancy clu/sim1_tmp/7842071913732967198/pref_step0 clu/sim1_tmp/7842071913732967198/aln_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 154 type: Aminoacid Target database size: 154 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 154 0s 55ms Time for merging to aln_step0: 0h 0m 0s 5ms 227 alignments calculated 207 sequence pairs passed the thresholds (0.911894 of overall calculated) 1.344156 hits per query sequence Time for processing: 0h 0m 0s 106ms clust clu/sim1_tmp/7842071913732967198/input_step_redundancy clu/sim1_tmp/7842071913732967198/aln_step0 clu/sim1_tmp/7842071913732967198/clu_step0 --cluster-mode 0 --max-iterations 1000 --similarity-type 1 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 154 0s 11ms Sort entries Find missing connections Found 1 new connections. Reconstruct initial order [=================================================================] 100.00% 154 0s 8ms Add missing connections [=================================================================] 100.00% 154 0s 0ms

Time for read in: 0h 0m 0s 84ms Total time: 0h 0m 0s 117ms

Size of the sequence database: 154 Size of the alignment database: 154 Number of clusters: 140

Writing results 0h 0m 0s 0ms Time for merging to clu_step0: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 120ms mergeclusters in/DB_in clu/sim1_clu clu/sim1_tmp/7842071913732967198/clu_redundancy clu/sim1_tmp/7842071913732967198/clu_step0 --threads 16 --compressed 0 -v 3

Clustering step 1 [=================================================================] 100.00% 154 0s 9ms Clustering step 2 [=================================================================] 100.00% 140 0s 43ms Write merged clustering [=================================================================] 100.00% 164 0s 49ms Time for merging to sim1_clu: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 119ms

similarity type 2:

Create directory clu/sim2_tmp cluster in/DB_in clu/sim2_clu clu/sim2_tmp --cov-mode 0 -c 0.8 --cluster-mode 0 --min-seq-id 0.9 --mask 0 -k 5 --spaced-kmer-pattern 110111 --similarity-type 2

MMseqs Version: a19f5a526012b849a723935acf56d10f39d24611 Substitution matrix nucl:nucleotide.out,aa:blosum62.out Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 4 k-mer length 5 k-score 2147483647 Alphabet size nucl:5,aa:21 Max sequence length 65535 Max results per query 20 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.8 Coverage mode 0 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 0 Mask lower case residues 0 Minimum diagonal score 15 Include identical seq. id. false Spaced k-mers 1 Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Spaced k-mer pattern 110111 Local temporary path Threads 16 Compressed 0 Verbosity 3 Add backtrace false Alignment mode 3 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0.9 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max reject 2147483647 Max accept 2147483647 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Zdrop 40 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Cluster mode 0 Max connected component depth 1000 Similarity type 2 Single step clustering false Cascaded clustering steps 3 Cluster reassign false Remove temporary files false Force restart with latest tmp false MPI runner k-mers per sequence 21 Scale k-mers per sequence nucl:0.200,aa:0.000 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false

Set cluster sensitivity to -s 1.000000 Set cluster iterations to 1 linclust in/DB_in clu/sim2_tmp/12164117771218227067/clu_redundancy clu/sim2_tmp/12164117771218227067/linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --alph-size nucl:5,aa:13 --kmer-per-seq 21 --spaced-kmer-mode 1 --spaced-kmer-pattern 110111 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-reuse 0

kmermatcher in/DB_in clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.9 --kmer-per-seq 21 --spaced-kmer-mode 1 --spaced-kmer-pattern 110111 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 16 --compressed 0 -v 3

kmermatcher in/DB_in clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.9 --kmer-per-seq 21 --spaced-kmer-mode 1 --spaced-kmer-pattern 110111 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 16 --compressed 0 -v 3

Database size: 164 type: Aminoacid Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Generate k-mers list for 1 split [=================================================================] 100.00% 164 0s 40ms Sort kmer 0h 0m 0s 0ms Sort by rep. sequence 0h 0m 0s 0ms Time for fill: 0h 0m 0s 0ms Time for merging to pref: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 172ms rescorediagonal in/DB_in in/DB_in clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 164 0s 38ms Time for merging to pref_rescore1: 0h 0m 0s 3ms ] 25.77% 43 eta 0s Time for processing: 0h 0m 0s 102ms clust in/DB_in clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref_rescore1 clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 164 0s 7ms Sort entries Find missing connections Found 10 new connections. Reconstruct initial order [=================================================================] 100.00% 164 0s 6ms Add missing connections [=================================================================] 100.00% 164 0s 1ms

Time for read in: 0h 0m 0s 92ms Total time: 0h 0m 0s 156ms

Size of the sequence database: 164 Size of the alignment database: 164 Number of clusters: 154

Writing results 0h 0m 0s 0ms Time for merging to pre_clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 159ms createsubdb clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/order_redundancy in/DB_in clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms createsubdb clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/order_redundancy clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms filterdb clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref_filter1 clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref_filter2 --filter-file clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/order_redundancy --threads 16 --compressed 0 -v 3

Filtering using file(s) [=================================================================] 100.00% 154 0s 13ms Time for merging to pref_filter2: 0h 0m 0s 2ms Time for processing: 0h 0m 0s 79ms rescorediagonal clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/input_step_redundancy clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/input_step_redundancy clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref_filter2 clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 154 0s 22ms Time for merging to pref_rescore2: 0h 0m 0s 3ms ] 49.02% 76 eta 0s Time for processing: 0h 0m 0s 76ms align clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/input_step_redundancy clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/input_step_redundancy clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pref_rescore2 clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 154 type: Aminoacid Target database size: 154 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 154 0s 34ms Time for merging to aln: 0h 0m 0s 2ms 154 alignments calculated 154 sequence pairs passed the thresholds (1.000000 of overall calculated) 1.000000 hits per query sequence Time for processing: 0h 0m 0s 142ms clust clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/input_step_redundancy clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/aln clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 154 0s 8ms Sort entries Find missing connections Found 0 new connections. Reconstruct initial order [=================================================================] 100.00% 154 0s 9ms Add missing connections [=================================================================] 100.00% 154 0s 0ms

Time for read in: 0h 0m 0s 84ms Total time: 0h 0m 0s 125ms

Size of the sequence database: 154 Size of the alignment database: 154 Number of clusters: 154

Writing results 0h 0m 0s 0ms Time for merging to clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 127ms mergeclusters in/DB_in clu/sim2_tmp/12164117771218227067/clu_redundancy clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/pre_clust clu/sim2_tmp/12164117771218227067/linclust/3267328275516617562/clust --threads 16 --compressed 0 -v 3

Clustering step 1 [=================================================================] 100.00% 154 0s 5ms Clustering step 2 [=================================================================] 100.00% 154 0s 28ms Write merged clustering [=================================================================] 100.00% 164 0s 38ms Time for merging to clu_redundancy: 0h 0m 0s 2ms Time for processing: 0h 0m 0s 72ms createsubdb clu/sim2_tmp/12164117771218227067/clu_redundancy in/DB_in clu/sim2_tmp/12164117771218227067/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms prefilter clu/sim2_tmp/12164117771218227067/input_step_redundancy clu/sim2_tmp/12164117771218227067/input_step_redundancy clu/sim2_tmp/12164117771218227067/pref_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 1 -k 5 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --diag-score 0 --exact-kmer-matching 0 --mask 0 --mask-lower-case 0 --min-ungapped-score 0 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --spaced-kmer-pattern 110111 --threads 16 --compressed 0 -v 3

Query database size: 154 type: Aminoacid Estimated memory consumption: 514M Target database size: 154 type: Aminoacid Index table k-mer threshold: 148 at k-mer size 5 Index table: counting k-mers [=================================================================] 100.00% 154 0s 14ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 154 0s 4ms Index statistics Entries: 221 DB size: 24 MB Avg k-mer size: 0.000069 Top 10 k-mers MKCFP 6 KYFPQ 6 HFVRF 4 CPIVP 4 WWWWW 4 RFDHR 3 WWFWW 3 FWWFW 2 WWWFW 2 MEMYY 2 Time for index table init: 0h 0m 0s 118ms Process prefiltering step 1 of 1

k-mer similarity threshold: 148 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 154 Target db start 1 to 154 [=================================================================] 100.00% 154 0s 23ms [==========================================================> ] 89.54% 138 eta 0s 1.177440 k-mers per position 3 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step0: 0h 0m 0s 4ms Time for processing: 0h 0m 0s 795ms align clu/sim2_tmp/12164117771218227067/input_step_redundancy clu/sim2_tmp/12164117771218227067/input_step_redundancy clu/sim2_tmp/12164117771218227067/pref_step0 clu/sim2_tmp/12164117771218227067/aln_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 154 type: Aminoacid Target database size: 154 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 154 0s 46ms Time for merging to aln_step0: 0h 0m 0s 4ms 227 alignments calculated 207 sequence pairs passed the thresholds (0.911894 of overall calculated) 1.344156 hits per query sequence Time for processing: 0h 0m 0s 72ms clust clu/sim2_tmp/12164117771218227067/input_step_redundancy clu/sim2_tmp/12164117771218227067/aln_step0 clu/sim2_tmp/12164117771218227067/clu_step0 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 154 0s 8ms Sort entries Find missing connections Found 1 new connections. Reconstruct initial order [=================================================================] 100.00% 154 0s 8ms Add missing connections [=================================================================] 100.00% 154 0s 0ms

Time for read in: 0h 0m 0s 85ms Total time: 0h 0m 0s 161ms

Size of the sequence database: 154 Size of the alignment database: 154 Number of clusters: 140

Writing results 0h 0m 0s 0ms Time for merging to clu_step0: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 165ms mergeclusters in/DB_in clu/sim2_clu clu/sim2_tmp/12164117771218227067/clu_redundancy clu/sim2_tmp/12164117771218227067/clu_step0 --threads 16 --compressed 0 -v 3

Clustering step 1 [=================================================================] 100.00% 154 0s 39ms Clustering step 2 [=================================================================] 100.00% 140 0s 132ms Write merged clustering [=================================================================] 100.00% 164 0s 195ms Time for merging to sim2_clu: 0h 0m 0s 3ms Time for processing: 0h 0m 0s 253ms

Context

Is combining --min-seq-id with --similarity type 1 the correct approach to define thresholds for clustering? From what I read in the guidelines, I expect this measure to be automatically converted.

milot-mirdita commented 3 years ago

Your sequences might be just too short for this setting to have any effect. Everything looks alright. You don't have to specify this parameter usually ever. --min-seq-id is taken into account regardless.

s-devos commented 3 years ago

Then how is the threshold defined when no parameter is provided?

I'm not sure how the length of the sequence could be the problem here. In the artificial I included handcrafted sequences that are just above the identity threshold (E.g. 21 AA long, edit distance 2 with 90% id) with mild and substantial amino acid changes (Ala > Gly vs. Ala > Cys). These are designed specifically to be clustered together only when considering identity, and not when considering similarity.

milot-mirdita commented 3 years ago

So from what I understand from (re-)reading the code, this parameter only affects the set-cover clustering algorithm and there it allows to switch to picking the cluster representative either by the highest sequence identity or alternatively by alignment score.

This is completely independent fron the alignment stage, where various parameters (e-value, min-seq-id, coverage, min-aln-length, max-accept/reject) decide which hits can get edges in-between them.

s-devos commented 3 years ago

That is interesting. In that case, I mistakingly drew lines between this option and the explanation on sequence identity computations in the guidelines (Header "How does MMseqs2 compute the sequence identity"). But now I understand that these distance metrics are hardcoded based on which coverage mode is selected.

"... and there it allows to switch to picking the cluster representative either by the highest sequence identity or alternatively by alignment score."

I am not sure how that works. the highest sequence identity or alignment score compared to what? Since set cover determines the representatives by the number of edges, is the sequence identity/alignment score for ties?

martin-steinegger commented 3 years ago

Setcover reassigns sequences after the it determined the representative sequences. It does this by checking for each sequence who is the best representative sequence based on sequence identity or score (--similarity-type).