s-devos commented 3 years ago

See reference to older, closed issue. Was this bug actually fixed, aside from the way you use the command? I still get the same error when using --cluster-reassign:

Originally posted by @s-devos in https://github.com/soedinglab/MMseqs2/issues/329#issuecomment-771475381

Input: mmseqs cluster DB_in DB_clustered tmp/ --cluster-reassign --cluster-mode 1 --cov-mode 0

Error: Found 100 new connections. Reconstruct initial order Alignment format is not supported! ] 0.00% 1 eta - Alignment format is not supported! Alignment format is not supported! Error: Clustering step 2 died

Further inspection shows that --clustering-reassign gives problems with whichever cascaded clustering option used; it only works together with --single-step-clustering, which isn't very useful

martin-steinegger commented 3 years ago

@s-devos could you please provide the log of the clustering?

s-devos commented 3 years ago

Tmp tmpfiles/ folder does not exist or is not a directory. Create dir tmpfiles/ cluster DB_in/fasta_in_db_HC DB_clustered/clustered_HC tmpfiles/ --cluster-reassign 1 --cluster-mode 1 --cov-mode 0

MMseqs Version: 96d452cb432fc4674991a48952deaf24d1787e77 Substitution matrix nucl:nucleotide.out,aa:blosum62.out Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 4 k-mer length 0 k-score 2147483647 Alphabet size nucl:5,aa:21 Max sequence length 65535 Max results per query 20 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.8 Coverage mode 0 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask lower case residues 0 Minimum diagonal score 15 Include identical seq. id. false Spaced k-mers 1 Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Spaced k-mer pattern Local temporary path Threads 16 Compressed 0 Verbosity 3 Add backtrace false Alignment mode 3 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Realign hits false Max reject 2147483647 Max accept 2147483647 Score bias 0 Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Zdrop 40 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Cluster mode 1 Max connected component depth 1000 Similarity type 2 Single step clustering false Cascaded clustering steps 3 Cluster reassign true Remove temporary files false Force restart with latest tmp false MPI runner k-mers per sequence 21 Scale k-mers per sequence nucl:0.200,aa:0.000 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false

Set cluster sensitivity to -s 6.000000 Connected component clustering produces less clusters in a single step clustering. Please use --single-step-clusterSet cluster iterations to 3 linclust DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/clu_redundancy tmpfiles//13298481167543164943/linclust --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --alph-size nucl:5,aa:13 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-reuse 0

kmermatcher DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/linclust/10229649346622198404/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 16 --compressed 0 -v 3

Database size: 303 type: Aminoacid Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Generate k-mers list for 1 split [=================================================================] 303 0s 51ms Sort kmer 0h 0m 0s 3ms Sort by rep. sequence 0h 0m 0s 0ms Time for fill: 0h 0m 0s 0ms Time for merging to pref: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 182ms rescorediagonal DB_in/fasta_in_db_HC DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/linclust/10229649346622198404/pref tmpfiles//13298481167543164943/linclust/10229649346622198404/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

[=================================================================] 303 0s 14ms Time for merging to pref_rescore1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 46ms clust DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/linclust/10229649346622198404/pref_rescore1 tmpfiles//13298481167543164943/linclust/10229649346622198404/pre_clust --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Connected Component [=================================================================] 303 0s 0ms Sort entries Find missing connections Found 28 new connections. Reconstruct initial order [=================================================================] 303 0s 6ms Add missing connections [=================================================================] 303 0s 0ms

Time for read in: 0h 0m 0s 66ms connected component mode Total time: 0h 0m 0s 93ms

Size of the sequence database: 303 Size of the alignment database: 303 Number of clusters: 276

Writing results 0h 0m 0s 0ms Time for merging to pre_clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 95ms createsubdb tmpfiles//13298481167543164943/linclust/10229649346622198404/order_redundancy DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/linclust/10229649346622198404/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms createsubdb tmpfiles//13298481167543164943/linclust/10229649346622198404/order_redundancy tmpfiles//13298481167543164943/linclust/10229649346622198404/pref tmpfiles//13298481167543164943/linclust/10229649346622198404/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms filterdb tmpfiles//13298481167543164943/linclust/10229649346622198404/pref_filter1 tmpfiles//13298481167543164943/linclust/10229649346622198404/pref_filter2 --filter-file tmpfiles//13298481167543164943/linclust/10229649346622198404/order_redundancy --threads 16 --compressed 0 -v 3

Filtering using file(s) [=================================================================] 276 0s 15ms Time for merging to pref_filter2: 0h 0m 0s 7ms Time for processing: 0h 0m 0s 31ms rescorediagonal tmpfiles//13298481167543164943/linclust/10229649346622198404/input_step_redundancy tmpfiles//13298481167543164943/linclust/10229649346622198404/input_step_redundancy tmpfiles//13298481167543164943/linclust/10229649346622198404/pref_filter2 tmpfiles//13298481167543164943/linclust/10229649346622198404/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

Can not find any score per column for coverage 0.800000 and sequence identity 0.000000. No hit will be filtered. [=================================================================] 276 0s 20ms Time for merging to pref_rescore2: 0h 0m 0s 3ms Time for processing: 0h 0m 0s 52ms align tmpfiles//13298481167543164943/linclust/10229649346622198404/input_step_redundancy tmpfiles//13298481167543164943/linclust/10229649346622198404/input_step_redundancy tmpfiles//13298481167543164943/linclust/10229649346622198404/pref_rescore2 tmpfiles//13298481167543164943/linclust/10229649346622198404/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 276 type: Aminoacid Target database size: 276 type: Aminoacid Calculation of alignments [=================================================================] 276 0s 26ms Time for merging to aln: 0h 0m 0s 1ms

276 alignments calculated. 276 sequence pairs passed the thresholds (1.000000 of overall calculated). 1.000000 hits per query sequence. Time for processing: 0h 0m 0s 82ms clust tmpfiles//13298481167543164943/linclust/10229649346622198404/input_step_redundancy tmpfiles//13298481167543164943/linclust/10229649346622198404/aln tmpfiles//13298481167543164943/linclust/10229649346622198404/clust --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Connected Component [=================================================================] 276 0s 0ms Sort entries Find missing connections Found 0 new connections. Reconstruct initial order [=================================================================] 276 0s 0ms Add missing connections [=================================================================] 276 0s 0ms

Time for read in: 0h 0m 0s 5ms connected component mode Total time: 0h 0m 0s 15ms

Size of the sequence database: 276 Size of the alignment database: 276 Number of clusters: 276

Writing results 0h 0m 0s 1ms Time for merging to clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 19ms mergeclusters DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/clu_redundancy tmpfiles//13298481167543164943/linclust/10229649346622198404/pre_clust tmpfiles//13298481167543164943/linclust/10229649346622198404/clust --threads 16 --compressed 0 -v 3

Clustering step 1 [=================================================================] 276 0s 14ms Clustering step 2 [=================================================================] 276 0s 37ms Write merged clustering [=================================================================] 303 0s 48ms Time for merging to clu_redundancy: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 102ms createsubdb tmpfiles//13298481167543164943/clu_redundancy DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms prefilter tmpfiles//13298481167543164943/input_step_redundancy tmpfiles//13298481167543164943/input_step_redundancy tmpfiles//13298481167543164943/pref_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 1 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 0 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 276 type: Aminoacid Estimated memory consumption: 978M Target database size: 276 type: Aminoacid Index table k-mer threshold: 154 at k-mer size 6 Index table: counting k-mers [=================================================================] 276 0s 28ms Index table: Masked residues: 0 Index table: fill [=================================================================] 276 0s 5ms Index statistics Entries: 1187 DB size: 488 MB Avg k-mer size: 0.000019 Top 10 k-mers XXXXXX 7 XXXXXX 4 XXXXXX 4 XXXXXX 4 XXXXXX 4 XXXXXX 3 XXXXXX 3 XXXXXX 3 XXXXXX 3 XXXXXX 3 Time for index table init: 0h 0m 1s 36ms Process prefiltering step 1 of 1

k-mer similarity threshold: 154 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 276 Target db start 1 to 276 [================================================================] =276 0s 28ms

1.374916 k-mers per position 5 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step0: 0h 0m 0s 5ms Time for processing: 0h 0m 1s 852ms align tmpfiles//13298481167543164943/input_step_redundancy tmpfiles//13298481167543164943/input_step_redundancy tmpfiles//13298481167543164943/pref_step0 tmpfiles//13298481167543164943/aln_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 276 type: Aminoacid Target database size: 276 type: Aminoacid Calculation of alignments [=================================================================] 276 0s 39ms Time for merging to aln_step0: 0h 0m 0s 2ms

413 alignments calculated. 406 sequence pairs passed the thresholds (0.983051 of overall calculated). 1.471014 hits per query sequence. Time for processing: 0h 0m 0s 82ms clust tmpfiles//13298481167543164943/input_step_redundancy tmpfiles//13298481167543164943/aln_step0 tmpfiles//13298481167543164943/clu_step0 --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Connected Component [=================================================================] 276 0s 6ms Sort entries Find missing connections Found 0 new connections. Reconstruct initial order [=================================================================] 276 0s 3ms Add missing connections [=================================================================] 276 0s 0ms

Time for read in: 0h 0m 0s 66ms connected component mode Total time: 0h 0m 0s 90ms

Size of the sequence database: 276 Size of the alignment database: 276 Number of clusters: 237

Writing results 0h 0m 0s 0ms Time for merging to clu_step0: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 96ms createsubdb tmpfiles//13298481167543164943/clu_step0 tmpfiles//13298481167543164943/input_step_redundancy tmpfiles//13298481167543164943/input_step1 -v 3 --subdb-mode 1

Time for merging to input_step1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms prefilter tmpfiles//13298481167543164943/input_step1 tmpfiles//13298481167543164943/input_step1 tmpfiles//13298481167543164943/pref_step1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 3.5 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 237 type: Aminoacid Estimated memory consumption: 977M Target database size: 237 type: Aminoacid Index table k-mer threshold: 131 at k-mer size 6 Index table: counting k-mers [=================================================================] 237 0s 39ms Index table: Masked residues: 0 Index table: fill [=================================================================] 237 0s 9ms Index statistics Entries: 1403 DB size: 488 MB Avg k-mer size: 0.000022 Top 10 k-mers XXXXXX 3 XXXXXX 3 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 Time for index table init: 0h 0m 1s 25ms Process prefiltering step 1 of 1

k-mer similarity threshold: 131 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 237 Target db start 1 to 237 [=================================================================] 237 0s 19ms

20.483280 k-mers per position 6 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step1: 0h 0m 0s 4ms Time for processing: 0h 0m 1s 707ms align tmpfiles//13298481167543164943/input_step1 tmpfiles//13298481167543164943/input_step1 tmpfiles//13298481167543164943/pref_step1 tmpfiles//13298481167543164943/aln_step1 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 237 type: Aminoacid Target database size: 237 type: Aminoacid Calculation of alignments [=================================================================] 237 0s 50ms Time for merging to aln_step1: 0h 0m 0s 3ms

306 alignments calculated. 271 sequence pairs passed the thresholds (0.885621 of overall calculated). 1.143460 hits per query sequence. Time for processing: 0h 0m 0s 99ms clust tmpfiles//13298481167543164943/input_step1 tmpfiles//13298481167543164943/aln_step1 tmpfiles//13298481167543164943/clu_step1 --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Connected Component [=================================================================] 237 0s 4ms Sort entries Find missing connections Found 10 new connections. Reconstruct initial order [=================================================================] 237 0s 3ms Add missing connections [=================================================================] 237 0s 0ms

Time for read in: 0h 0m 0s 60ms connected component mode Total time: 0h 0m 0s 84ms

Size of the sequence database: 237 Size of the alignment database: 237 Number of clusters: 218

Writing results 0h 0m 0s 2ms Time for merging to clu_step1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 89ms createsubdb tmpfiles//13298481167543164943/clu_step1 tmpfiles//13298481167543164943/input_step1 tmpfiles//13298481167543164943/input_step2 -v 3 --subdb-mode 1

Time for merging to input_step2: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms prefilter tmpfiles//13298481167543164943/input_step2 tmpfiles//13298481167543164943/input_step2 tmpfiles//13298481167543164943/pref_step2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 6 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 218 type: Aminoacid Estimated memory consumption: 977M Target database size: 218 type: Aminoacid Index table k-mer threshold: 109 at k-mer size 6 Index table: counting k-mers [=================================================================] 218 0s 31ms Index table: Masked residues: 0 Index table: fill [=================================================================] 218 0s 4ms Index statistics Entries: 1318 DB size: 488 MB Avg k-mer size: 0.000021 Top 10 k-mers XXXXXX 3 XXXXXX 3 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 Time for index table init: 0h 0m 0s 994ms Process prefiltering step 1 of 1

k-mer similarity threshold: 109 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 218 Target db start 1 to 218 [=================================================================] 218 0s 55ms

193.314206 k-mers per position 8 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step2: 0h 0m 0s 4ms Time for processing: 0h 0m 1s 763ms align tmpfiles//13298481167543164943/input_step2 tmpfiles//13298481167543164943/input_step2 tmpfiles//13298481167543164943/pref_step2 tmpfiles//13298481167543164943/aln_step2 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 218 type: Aminoacid Target database size: 218 type: Aminoacid Calculation of alignments [=================================================================] 218 0s 82ms Time for merging to aln_step2: 0h 0m 0s 2ms

358 alignments calculated. 247 sequence pairs passed the thresholds (0.689944 of overall calculated). 1.133028 hits per query sequence. Time for processing: 0h 0m 0s 123ms clust tmpfiles//13298481167543164943/input_step2 tmpfiles//13298481167543164943/aln_step2 tmpfiles//13298481167543164943/clu_step2 --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Connected Component [=================================================================] 218 0s 9ms Sort entries Find missing connections Found 7 new connections. Reconstruct initial order [=================================================================] 218 0s 3ms Add missing connections [=================================================================] 218 0s 0ms

Time for read in: 0h 0m 0s 77ms connected component mode Total time: 0h 0m 0s 102ms

Size of the sequence database: 218 Size of the alignment database: 218 Number of clusters: 200

Writing results 0h 0m 0s 0ms Time for merging to clu_step2: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 110ms mergeclusters DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/clu tmpfiles//13298481167543164943/clu_redundancy tmpfiles//13298481167543164943/clu_step0 tmpfiles//13298481167543164943/clu_step1 tmpfiles//13298481167543164943/clu_step2

Clustering step 1 [=================================================================] 276 0s 7ms Clustering step 2 [=================================================================] 237 0s 23ms Clustering step 3 [=================================================================] 218 0s 41ms Clustering step 4 [=================================================================] 200 0s 62ms Write merged clustering [=================================================================] 303 0s 66ms Time for merging to clu: 0h 0m 0s 3ms Time for processing: 0h 0m 0s 112ms align DB_in/fasta_in_db_HC DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/clu tmpfiles//13298481167543164943/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 303 type: Aminoacid Target database size: 303 type: Aminoacid Calculation of alignments [=================================================================] 200 0s 16ms Time for merging to aln: 0h 0m 0s 3ms

303 alignments calculated. 293 sequence pairs passed the thresholds (0.966997 of overall calculated). 1.465000 hits per query sequence. Time for processing: 0h 0m 0s 71ms subtractdbs tmpfiles//13298481167543164943/clu tmpfiles//13298481167543164943/aln tmpfiles//13298481167543164943/clu_not_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

subtractdbs tmpfiles//13298481167543164943/clu tmpfiles//13298481167543164943/aln tmpfiles//13298481167543164943/clu_not_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

Remove tmpfiles//13298481167543164943/aln ids from tmpfiles//13298481167543164943/clu [=================================================================] 200 0s 9ms Time for merging to clu_not_accepted: 0h 0m 0s 6ms Time for processing: 0h 0m 0s 29ms subtractdbs tmpfiles//13298481167543164943/clu tmpfiles//13298481167543164943/clu_not_accepted tmpfiles//13298481167543164943/clu_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

subtractdbs tmpfiles//13298481167543164943/clu tmpfiles//13298481167543164943/clu_not_accepted tmpfiles//13298481167543164943/clu_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

Remove tmpfiles//13298481167543164943/clu_not_accepted ids from tmpfiles//13298481167543164943/clu [=================================================================] 200 0s 8ms Time for merging to clu_accepted: 0h 0m 0s 9ms Time for processing: 0h 0m 0s 22ms swapdb tmpfiles//13298481167543164943/clu_not_accepted tmpfiles//13298481167543164943/clu_not_accepted_swap --threads 16 --compressed 0 -v 3

[=================================================================] 200 0s 3ms Computing offsets. [=================================================================] 200 0s 7ms

Reading results. [=================================================================] 200 0s 5ms

Output database: tmpfiles//13298481167543164943/clu_not_accepted_swap [=================================================================] 284 0s 9ms

Time for merging to clu_not_accepted_swap: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 70ms createsubdb tmpfiles//13298481167543164943/clu_not_accepted_swap DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/seq_wrong_assigned -v 3

Time for merging to seq_wrong_assigned: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms createsubdb tmpfiles//13298481167543164943/clu DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/seq_seeds -v 3

Time for merging to seq_seeds: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms prefilter tmpfiles//13298481167543164943/seq_wrong_assigned tmpfiles//13298481167543164943/seq_seeds.merged tmpfiles//13298481167543164943/seq_wrong_assigned_pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 6 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 10 type: Aminoacid Estimated memory consumption: 977M Target database size: 210 type: Aminoacid Index table k-mer threshold: 109 at k-mer size 6 Index table: counting k-mers [=================================================================] 210 0s 30ms Index table: Masked residues: 0 Index table: fill [=================================================================] 210 0s 6ms Index statistics Entries: 1284 DB size: 488 MB Avg k-mer size: 0.000020 Top 10 k-mers XXXXXX 3 XXXXXX 3 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 Time for index table init: 0h 0m 0s 989ms Process prefiltering step 1 of 1

k-mer similarity threshold: 109 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 10 Target db start 1 to 210 [=================================================================] 10 0s 14ms

487.061439 k-mers per position 14 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 3 sequences passed prefiltering per query sequence 3 median result list length 0 sequences with 0 size result lists Time for merging to seq_wrong_assigned_pref: 0h 0m 0s 1ms Time for processing: 0h 0m 1s 609ms swapdb tmpfiles//13298481167543164943/seq_wrong_assigned_pref tmpfiles//13298481167543164943/seq_wrong_assigned_pref_swaped --threads 16 --compressed 0 -v 3

[=================================================================] 10 0s 5ms Computing offsets. [=================================================================] 10 0s 2ms

Reading results. [=================================================================] 10 0s 5ms

Output database: tmpfiles//13298481167543164943/seq_wrong_assigned_pref_swaped [=================================================================] 297 0s 5ms

Time for merging to seq_wrong_assigned_pref_swaped: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 73ms align tmpfiles//13298481167543164943/seq_seeds.merged tmpfiles//13298481167543164943/seq_wrong_assigned tmpfiles//13298481167543164943/seq_wrong_assigned_pref_swaped tmpfiles//13298481167543164943/seq_wrong_assigned_pref_swaped_aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 210 type: Aminoacid Target database size: 10 type: Aminoacid Calculation of alignments [=================================================================] 22 0s 31ms Time for merging to seq_wrong_assigned_pref_swaped_aln: 0h 0m 0s 0ms

29 alignments calculated. 21 sequence pairs passed the thresholds (0.724138 of overall calculated). 0.954545 hits per query sequence. Time for processing: 0h 0m 0s 80ms filterdb tmpfiles//13298481167543164943/seq_wrong_assigned_pref_swaped_aln tmpfiles//13298481167543164943/seq_wrong_assigned_pref_swaped_aln_ocol --trim-to-one-column --threads 16 --compressed 0 -v 3

Filtering using regular expression [=================================================================] 22 0s 20ms Time for merging to seq_wrong_assigned_pref_swaped_aln_ocol: 0h 0m 0s 2ms Time for processing: 0h 0m 0s 57ms mergedbs tmpfiles//13298481167543164943/seq_seeds.merged tmpfiles//13298481167543164943/clu_accepted_plus_wrong tmpfiles//13298481167543164943/clu_accepted tmpfiles//13298481167543164943/seq_wrong_assigned_pref_swaped_aln_ocol --compressed 0 -v 3

Merging the results to tmpfiles//13298481167543164943/clu_accepted_plus_wrong [=================================================================] 210 0s 0ms Time for merging to clu_accepted_plus_wrong: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms tsv2db tmpfiles//13298481167543164943/missing.single.seqs tmpfiles//13298481167543164943/missing.single.seqs.db --output-dbtype 6 --compressed 0 -v 3

Output database type: Clustering Time for merging to missing.single.seqs.db: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms mergedbs DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/clu_accepted_plus_wrong_plus_single tmpfiles//13298481167543164943/clu_accepted_plus_wrong tmpfiles//13298481167543164943/missing.single.seqs.db --compressed 0 -v 3

Merging the results to tmpfiles//13298481167543164943/clu_accepted_plus_wrong_plus_single [=================================================================] 303 0s 0ms Time for merging to clu_accepted_plus_wrong_plus_single: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms clust DB_in/fasta_in_db_HC tmpfiles//13298481167543164943/clu_accepted_plus_wrong_plus_single DB_clustered/clustered_HC --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Connected Component [=================================================================Alignment format is not supported! Alignment format is not supported! Alignment format is not supported! ] 303 0s 0ms Sort entries Find missing connections Found 100 new connections. Reconstruct initial order [Error: Clustering step 2 died

s-devos commented 3 years ago

With other clustering modes, I run into the same errors as mentioned earlier here:

374

Along with the other issues I am running into, I am under the strong impression that the small size of my sequences (7-20 AA long) Is the problem. Is MMseqs2 suited to handle such data?

milot-mirdita commented 3 years ago

It is, but not with the default parameters, take a look at issue #373 for what parameters might be useful (especially shorter kmer size and shorter spaced seed pattern).

s-devos commented 3 years ago

I have tried that; even with minimum k-mer size 5 and default- or various spacing patterns I run into problems such as:

Connected Component clustering does not cluster anything together under 14 AA long (which is 80% of my data);
The cluster-reassign option instantly kills any cascaded clustering mode;
Identical short sequences are not clustered together, regardless of ID- thresholds or similarity types.

martin-steinegger commented 3 years ago

@s-devos --reassign-cluster does not make sense for connected component clustering. Connected component clusters contain main transitive members and this mode will remove them again.

s-devos commented 3 years ago

@martin-steinegger That makes sense, although it also happens with Set Cover (it's --cluster-reassign, right? --reassign-cluster is not recognized):

Create directory tmp/ cluster ../DB_in/fasta_in_db_HC DB_clu tmp/ --cov-mode 0 --cluster-mode 0 --cluster-reassign 1

MMseqs Version: 0828d86539a4b6d7f64bc369a5b29920862afc5a Substitution matrix nucl:nucleotide.out,aa:blosum62.out Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 4 k-mer length 0 k-score 2147483647 Alphabet size nucl:5,aa:21 Max sequence length 65535 Max results per query 20 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.8 Coverage mode 0 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask lower case residues 0 Minimum diagonal score 15 Include identical seq. id. false Spaced k-mers 1 Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Spaced k-mer pattern Local temporary path Threads 16 Compressed 0 Verbosity 3 Add backtrace false Alignment mode 3 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max reject 2147483647 Max accept 2147483647 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Zdrop 40 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Cluster mode 0 Max connected component depth 1000 Similarity type 2 Single step clustering false Cascaded clustering steps 3 Cluster reassign true Remove temporary files false Force restart with latest tmp false MPI runner k-mers per sequence 21 Scale k-mers per sequence nucl:0.200,aa:0.000 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false

Set cluster sensitivity to -s 6.000000 Set cluster iterations to 3 linclust ../DB_in/fasta_in_db_HC tmp//538598962955004214/clu_redundancy tmp//538598962955004214/linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --alph-size nucl:5,aa:13 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-reuse 0

kmermatcher ../DB_in/fasta_in_db_HC tmp//538598962955004214/linclust/16628284907041385511/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 16 --compressed 0 -v 3

Database size: 303 type: Aminoacid Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Generate k-mers list for 1 split [=================================================================] 100.00% 303 0s 29ms Sort kmer 0h 0m 0s 2ms Sort by rep. sequence 0h 0m 0s 0ms Time for fill: 0h 0m 0s 0ms Time for merging to pref: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 160ms rescorediagonal ../DB_in/fasta_in_db_HC ../DB_in/fasta_in_db_HC tmp//538598962955004214/linclust/16628284907041385511/pref tmp//538598962955004214/linclust/16628284907041385511/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 303 0s 15ms Time for merging to pref_rescore1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 47ms clust ../DB_in/fasta_in_db_HC tmp//538598962955004214/linclust/16628284907041385511/pref_rescore1 tmp//538598962955004214/linclust/16628284907041385511/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 303 0s 5ms Sort entries Find missing connections Found 28 new connections. Reconstruct initial order [=================================================================] 100.00% 303 0s 14ms Add missing connections [=================================================================] 100.00% 303 0s 1ms

Time for read in: 0h 0m 0s 74ms Total time: 0h 0m 0s 108ms

Size of the sequence database: 303 Size of the alignment database: 303 Number of clusters: 276

Writing results 0h 0m 0s 0ms Time for merging to pre_clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 113ms createsubdb tmp//538598962955004214/linclust/16628284907041385511/order_redundancy ../DB_in/fasta_in_db_HC tmp//538598962955004214/linclust/16628284907041385511/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms createsubdb tmp//538598962955004214/linclust/16628284907041385511/order_redundancy tmp//538598962955004214/linclust/16628284907041385511/pref tmp//538598962955004214/linclust/16628284907041385511/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms filterdb tmp//538598962955004214/linclust/16628284907041385511/pref_filter1 tmp//538598962955004214/linclust/16628284907041385511/pref_filter2 --filter-file tmp//538598962955004214/linclust/16628284907041385511/order_redundancy --threads 16 --compressed 0 -v 3

Filtering using file(s) [=================================================================] 100.00% 276 0s 15ms Time for merging to pref_filter2: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 46ms rescorediagonal tmp//538598962955004214/linclust/16628284907041385511/input_step_redundancy tmp//538598962955004214/linclust/16628284907041385511/input_step_redundancy tmp//538598962955004214/linclust/16628284907041385511/pref_filter2 tmp//538598962955004214/linclust/16628284907041385511/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

Can not find any score per column for coverage 0.800000 and sequence identity 0.000000. No hit will be filtered. [=================================================================] 100.00% 276 0s 20ms Time for merging to pref_rescore2: 0h 0m 0s 11ms ] 54.55% 151 eta 0s Time for processing: 0h 0m 0s 54ms align tmp//538598962955004214/linclust/16628284907041385511/input_step_redundancy tmp//538598962955004214/linclust/16628284907041385511/input_step_redundancy tmp//538598962955004214/linclust/16628284907041385511/pref_rescore2 tmp//538598962955004214/linclust/16628284907041385511/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 276 type: Aminoacid Target database size: 276 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 276 0s 97ms Time for merging to aln: 0h 0m 0s 9ms 276 alignments calculated 276 sequence pairs passed the thresholds (1.000000 of overall calculated) 1.000000 hits per query sequence Time for processing: 0h 0m 0s 117ms clust tmp//538598962955004214/linclust/16628284907041385511/input_step_redundancy tmp//538598962955004214/linclust/16628284907041385511/aln tmp//538598962955004214/linclust/16628284907041385511/clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 276 0s 9ms Sort entries Find missing connections Found 0 new connections. Reconstruct initial order [=================================================================] 100.00% 276 0s 8ms Add missing connections [=================================================================] 100.00% 276 0s 1ms

Time for read in: 0h 0m 0s 83ms Total time: 0h 0m 0s 104ms

Size of the sequence database: 276 Size of the alignment database: 276 Number of clusters: 276

Writing results 0h 0m 0s 0ms Time for merging to clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 113ms mergeclusters ../DB_in/fasta_in_db_HC tmp//538598962955004214/clu_redundancy tmp//538598962955004214/linclust/16628284907041385511/pre_clust tmp//538598962955004214/linclust/16628284907041385511/clust --threads 16 --compressed 0 -v 3

Clustering step 1 [=================================================================] 100.00% 276 0s 22ms Clustering step 2 [=================================================================] 100.00% 276 0s 54ms Write merged clustering [=================================================================] 100.00% 303 0s 67ms Time for merging to clu_redundancy: 0h 0m 0s 5ms Time for processing: 0h 0m 0s 121ms createsubdb tmp//538598962955004214/clu_redundancy ../DB_in/fasta_in_db_HC tmp//538598962955004214/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms prefilter tmp//538598962955004214/input_step_redundancy tmp//538598962955004214/input_step_redundancy tmp//538598962955004214/pref_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 1 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 0 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 276 type: Aminoacid Estimated memory consumption: 978M Target database size: 276 type: Aminoacid Index table k-mer threshold: 154 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 276 0s 26ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 276 0s 6ms Index statistics Entries: 1187 DB size: 488 MB Avg k-mer size: 0.000019 Top 10 k-mers XXXXXX 7 XXXXXX 4 XXXXXX 4 XXXXXX 4 XXXXXX 4 XXXXXX 3 XXXXXX 3 XXXXXX 3 XXXXXX 3 XXXXXX 3 Time for index table init: 0h 0m 1s 58ms Process prefiltering step 1 of 1

k-mer similarity threshold: 154 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 276 Target db start 1 to 276 [=================================================================] 100.00% 276 0s 31ms

1.374916 k-mers per position 5 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step0: 0h 0m 0s 2ms Time for processing: 0h 0m 1s 665ms align tmp//538598962955004214/input_step_redundancy tmp//538598962955004214/input_step_redundancy tmp//538598962955004214/pref_step0 tmp//538598962955004214/aln_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 276 type: Aminoacid Target database size: 276 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 276 0s 77ms Time for merging to aln_step0: 0h 0m 0s 9ms 413 alignments calculated 406 sequence pairs passed the thresholds (0.983051 of overall calculated) 1.471014 hits per query sequence Time for processing: 0h 0m 0s 121ms clust tmp//538598962955004214/input_step_redundancy tmp//538598962955004214/aln_step0 tmp//538598962955004214/clu_step0 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 276 0s 12ms Sort entries Find missing connections Found 0 new connections. Reconstruct initial order [=================================================================] 100.00% 276 0s 9ms Add missing connections [=================================================================] 100.00% 276 0s 4ms

Time for read in: 0h 0m 0s 76ms Total time: 0h 0m 0s 99ms

Size of the sequence database: 276 Size of the alignment database: 276 Number of clusters: 239

Writing results 0h 0m 0s 0ms Time for merging to clu_step0: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 111ms createsubdb tmp//538598962955004214/clu_step0 tmp//538598962955004214/input_step_redundancy tmp//538598962955004214/input_step1 -v 3 --subdb-mode 1

Time for merging to input_step1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms prefilter tmp//538598962955004214/input_step1 tmp//538598962955004214/input_step1 tmp//538598962955004214/pref_step1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 3.5 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 239 type: Aminoacid Estimated memory consumption: 977M Target database size: 239 type: Aminoacid Index table k-mer threshold: 131 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 239 0s 13ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 239 0s 2ms Index statistics Entries: 1414 DB size: 488 MB Avg k-mer size: 0.000022 Top 10 k-mers XXXXXX 3 XXXXXX 3 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 Time for index table init: 0h 0m 1s 51ms Process prefiltering step 1 of 1

k-mer similarity threshold: 131 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 239 Target db start 1 to 239 [=================================================================] 100.00% 239 0s 23ms [============================================================> ] 92.44% 221 eta 0s 20.598031 k-mers per position 6 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step1: 0h 0m 0s 2ms Time for processing: 0h 0m 1s 760ms align tmp//538598962955004214/input_step1 tmp//538598962955004214/input_step1 tmp//538598962955004214/pref_step1 tmp//538598962955004214/aln_step1 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 239 type: Aminoacid Target database size: 239 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 239 0s 34ms Time for merging to aln_step1: 0h 0m 0s 6ms 308 alignments calculated 274 sequence pairs passed the thresholds (0.889610 of overall calculated) 1.146443 hits per query sequence Time for processing: 0h 0m 0s 70ms clust tmp//538598962955004214/input_step1 tmp//538598962955004214/aln_step1 tmp//538598962955004214/clu_step1 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 239 0s 4ms Sort entries Find missing connections Found 9 new connections. Reconstruct initial order [=================================================================] 100.00% 239 0s 9ms Add missing connections [=================================================================] 100.00% 239 0s 0ms

Time for read in: 0h 0m 0s 74ms Total time: 0h 0m 0s 89ms

Size of the sequence database: 239 Size of the alignment database: 239 Number of clusters: 221

Writing results 0h 0m 0s 0ms Time for merging to clu_step1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 92ms createsubdb tmp//538598962955004214/clu_step1 tmp//538598962955004214/input_step1 tmp//538598962955004214/input_step2 -v 3 --subdb-mode 1

Time for merging to input_step2: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms prefilter tmp//538598962955004214/input_step2 tmp//538598962955004214/input_step2 tmp//538598962955004214/pref_step2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 6 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 221 type: Aminoacid Estimated memory consumption: 977M Target database size: 221 type: Aminoacid Index table k-mer threshold: 109 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 221 0s 29ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 221 0s 17ms Index statistics Entries: 1337 DB size: 488 MB Avg k-mer size: 0.000021 Top 10 k-mers XXXXXX 3 XXXXXX 3 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 Time for index table init: 0h 0m 0s 985ms Process prefiltering step 1 of 1

k-mer similarity threshold: 109 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 221 Target db start 1 to 221 [=================================================================] 100.00% 221 0s 31ms

190.836300 k-mers per position 8 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step2: 0h 0m 0s 4ms Time for processing: 0h 0m 1s 644ms align tmp//538598962955004214/input_step2 tmp//538598962955004214/input_step2 tmp//538598962955004214/pref_step2 tmp//538598962955004214/aln_step2 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 221 type: Aminoacid Target database size: 221 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 221 0s 47ms Time for merging to aln_step2: 0h 0m 0s 5ms 364 alignments calculated 254 sequence pairs passed the thresholds (0.697802 of overall calculated) 1.149321 hits per query sequence Time for processing: 0h 0m 0s 94ms clust tmp//538598962955004214/input_step2 tmp//538598962955004214/aln_step2 tmp//538598962955004214/clu_step2 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 221 0s 11ms Sort entries Find missing connections Found 7 new connections. Reconstruct initial order [=================================================================] 100.00% 221 0s 4ms Add missing connections [=================================================================] 100.00% 221 0s 4ms

Time for read in: 0h 0m 0s 68ms Total time: 0h 0m 0s 93ms

Size of the sequence database: 221 Size of the alignment database: 221 Number of clusters: 201

Writing results 0h 0m 0s 1ms Time for merging to clu_step2: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 97ms mergeclusters ../DB_in/fasta_in_db_HC tmp//538598962955004214/clu tmp//538598962955004214/clu_redundancy tmp//538598962955004214/clu_step0 tmp//538598962955004214/clu_step1 tmp//538598962955004214/clu_step2

Clustering step 1 [=================================================================] 100.00% 276 0s 4ms Clustering step 2 [=================================================================] 100.00% 239 0s 23ms Clustering step 3 [=================================================================] 100.00% 221 0s 37ms Clustering step 4 [=================================================================] 100.00% 201 0s 68ms Write merged clustering [=================================================================] 100.00% 303 0s 91ms Time for merging to clu: 0h 0m 0s 11ms Time for processing: 0h 0m 0s 109ms align ../DB_in/fasta_in_db_HC ../DB_in/fasta_in_db_HC tmp//538598962955004214/clu tmp//538598962955004214/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 303 type: Aminoacid Target database size: 303 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 201 0s 52ms Time for merging to aln: 0h 0m 0s 1ms 303 alignments calculated 294 sequence pairs passed the thresholds (0.970297 of overall calculated) 1.462687 hits per query sequence Time for processing: 0h 0m 0s 122ms subtractdbs tmp//538598962955004214/clu tmp//538598962955004214/aln tmp//538598962955004214/clu_not_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

subtractdbs tmp//538598962955004214/clu tmp//538598962955004214/aln tmp//538598962955004214/clu_not_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

Remove tmp//538598962955004214/aln ids from tmp//538598962955004214/clu [=================================================================] 100.00% 201 0s 22ms Time for merging to clu_not_accepted: 0h 0m 0s 5ms Time for processing: 0h 0m 0s 40ms subtractdbs tmp//538598962955004214/clu tmp//538598962955004214/clu_not_accepted tmp//538598962955004214/clu_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

subtractdbs tmp//538598962955004214/clu tmp//538598962955004214/clu_not_accepted tmp//538598962955004214/clu_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

Remove tmp//538598962955004214/clu_not_accepted ids from tmp//538598962955004214/clu [=================================================================] 100.00% 201 0s 12ms Time for merging to clu_accepted: 0h 0m 0s 6ms Time for processing: 0h 0m 0s 25ms swapdb tmp//538598962955004214/clu_not_accepted tmp//538598962955004214/clu_not_accepted_swap --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 201 0s 14ms Computing offsets. [=================================================================] 100.00% 201 0s 4ms

Reading results. [=================================================================] 100.00% 201 0s 7ms

Output database: tmp//538598962955004214/clu_not_accepted_swap [=================================================================] 100.00% 284 0s 5ms

Time for merging to clu_not_accepted_swap: 0h 0m 0s 2ms Time for processing: 0h 0m 0s 66ms createsubdb tmp//538598962955004214/clu_not_accepted_swap ../DB_in/fasta_in_db_HC tmp//538598962955004214/seq_wrong_assigned -v 3

Time for merging to seq_wrong_assigned: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms createsubdb tmp//538598962955004214/clu ../DB_in/fasta_in_db_HC tmp//538598962955004214/seq_seeds -v 3

Time for merging to seq_seeds: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms prefilter tmp//538598962955004214/seq_wrong_assigned tmp//538598962955004214/seq_seeds.merged tmp//538598962955004214/seq_wrong_assigned_pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 6 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 9 type: Aminoacid Estimated memory consumption: 977M Target database size: 210 type: Aminoacid Index table k-mer threshold: 109 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 210 0s 27ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 210 0s 8ms Index statistics Entries: 1285 DB size: 488 MB Avg k-mer size: 0.000020 Top 10 k-mers XXXXXX 3 XXXXXX 3 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 Time for index table init: 0h 0m 1s 174ms Process prefiltering step 1 of 1

k-mer similarity threshold: 109 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 9 Target db start 1 to 210 [=================================================================] 100.00% 9 0s 11ms

460.860859 k-mers per position 14 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 3 sequences passed prefiltering per query sequence 3 median result list length 0 sequences with 0 size result lists Time for merging to seq_wrong_assigned_pref: 0h 0m 0s 1ms Time for processing: 0h 0m 1s 801ms swapdb tmp//538598962955004214/seq_wrong_assigned_pref tmp//538598962955004214/seq_wrong_assigned_pref_swaped --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 9 0s 13ms Computing offsets. [=================================================================] 100.00% 9 0s 5ms

Reading results. [=================================================================] 100.00% 9 0s 4ms

Output database: tmp//538598962955004214/seq_wrong_assigned_pref_swaped [=================================================================] 100.00% 297 0s 6ms

Time for merging to seq_wrong_assigned_pref_swaped: 0h 0m 0s 2ms Time for processing: 0h 0m 0s 85ms align tmp//538598962955004214/seq_seeds.merged tmp//538598962955004214/seq_wrong_assigned tmp//538598962955004214/seq_wrong_assigned_pref_swaped tmp//538598962955004214/seq_wrong_assigned_pref_swaped_aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 210 type: Aminoacid Target database size: 9 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 18 0s 12ms Time for merging to seq_wrong_assigned_pref_swaped_aln: 0h 0m 0s 0ms 24 alignments calculated 18 sequence pairs passed the thresholds (0.750000 of overall calculated) 1.000000 hits per query sequence Time for processing: 0h 0m 0s 63ms filterdb tmp//538598962955004214/seq_wrong_assigned_pref_swaped_aln tmp//538598962955004214/seq_wrong_assigned_pref_swaped_aln_ocol --trim-to-one-column --threads 16 --compressed 0 -v 3

Filtering using regular expression [=================================================================] 100.00% 18 0s 13ms Time for merging to seq_wrong_assigned_pref_swaped_aln_ocol: 0h 0m 0s 5ms Time for processing: 0h 0m 0s 61ms mergedbs tmp//538598962955004214/seq_seeds.merged tmp//538598962955004214/clu_accepted_plus_wrong tmp//538598962955004214/clu_accepted tmp//538598962955004214/seq_wrong_assigned_pref_swaped_aln_ocol --merge-stop-empty 0 --compressed 0 -v 3

Merging the results to tmp//538598962955004214/clu_accepted_plus_wrong [=================================================================] 100.00% 210 0s 1ms Time for merging to clu_accepted_plus_wrong: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 3ms tsv2db tmp//538598962955004214/missing.single.seqs tmp//538598962955004214/missing.single.seqs.db --output-dbtype 6 --compressed 0 -v 3

Output database type: Clustering Time for merging to missing.single.seqs.db: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms mergedbs ../DB_in/fasta_in_db_HC tmp//538598962955004214/clu_accepted_plus_wrong_plus_single tmp//538598962955004214/clu_accepted_plus_wrong tmp//538598962955004214/missing.single.seqs.db --merge-stop-empty 0 --compressed 0 -v 3

Merging the results to tmp//538598962955004214/clu_accepted_plus_wrong_plus_single [=================================================================] 100.00% 303 0s 2ms Time for merging to clu_accepted_plus_wrong_plus_single: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 4ms clust ../DB_in/fasta_in_db_HC tmp//538598962955004214/clu_accepted_plus_wrong_plus_single DB_clu --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 303 0s 10ms Sort entries Find missing connections Found 98 new connections. Reconstruct initial order Alignment format is not supported! ] 0.00% 1 eta - Alignment format is not supported! Alignment format is not supported! 31mAlignment format is not supported! Error: Clustering step 2 died

milot-mirdita commented 3 years ago

Can you upload your sequences (or a subset that also has this issue) somewhere? We need to reproduce the issue somehow.

milot-mirdita commented 3 years ago

This looks pretty bad:

Top 10 k-mers
XXXXXX 3
XXXXXX 3
XXXXXX 2
XXXXXX 2
XXXXXX 2
XXXXXX 2
XXXXXX 2
XXXXXX 2
XXXXXX 2
XXXXXX 2

Can you try using --mask 0? Also add -k 5 --spaced-kmer-pattern 110111. This should allow matching sequences that are 7 residues long.

s-devos commented 3 years ago

My sequences are proprietary, but I'll try to reproduce it with an artificial short sequence set

@milot-mirdita I have censored that myself!

s-devos commented 3 years ago

with --mask 0:

mmseqs cluster ../DB_in/fasta_in_db_HC DB_clu tmp/ --mask 0 --cov-mode 0 --cluster-mode 0 --cluster-reassign 1

Alignment format is not supported! ] 0.00% 1 eta - 31Alignment format is not supported! 31Error: Clustering step 2 died

s-devos commented 3 years ago

@milot-mirdita things are not getting much more logical:

mmseqs cluster ../DB_in/fasta_in_db_HC DB_clu tmp/ --mask 0 -k 5 --spaced-kmer-pattern 110111 --cov-mode 0 --cluster-mode 0 --cluster-reassign 1

output: User-specified k-mer pattern is not consistent with stated k-mer size

milot-mirdita commented 3 years ago

Did you happen to create an index for this fasta_in_db_HC (i.e. with createindex)? This shouldn't be happening.

s-devos commented 3 years ago

@milot-mirdita I just created a db using createdb with a sensible fasta file

s-devos commented 3 years ago

mmseqs createdb short_seqs.fasta DB_in/fasta_in mmseqs cluster DB_in/fasta_in DB_clu tmp/ -k 5 --spaced-kmer-pattern 110111

gives:

User-specified k-mer pattern is not consistent with stated k-mer size Error: kmermatcher died Error: linclust died

s-devos commented 3 years ago

Artificial fasta with short (10 AA) sequences:

artificial.txt

commands:

mmseqs createdb artificial.fasta DB_artificial/artificial_DB mmseqs cluster DB_artificial/artificial_DB db_clu tmp/ -k 5 --spaced-kmer-pattern 110111

output ends with:

User-specified k-mer pattern is not consistent with stated k-mer size User-specified k-mer pattern is not consistent with stated k-mer size Error: kmermatcher died

s-devos commented 3 years ago

and for the original cluster-reassign problem:

mmseqs cluster DB_in/artificial_DB DB_clu/set_cover_reassign set_cover_reassign_tmp --cluster-mode 0 --cov-mode 0 --cluster-reassign 1 --mask 0

gives:

swapdb set_cover_reassign_tmp/17639961554283803127/seq_wrong_assigned_pref set_cover_reassign_tmp/17639961554283803127/seq_wrong_assigned_pref_swaped --threads 16 --compressed 0 -v 3

Input set_cover_reassign_tmp/17639961554283803127/seq_wrong_assigned_pref does not exist
Error: swapdb2 reassign died

milot-mirdita commented 3 years ago

I can reproduce both of these issue, I'll fix them hopefully later today.

Still quite puzzled where the error Alignment format is not supported! came from though.

s-devos commented 3 years ago

Very glad to hear it is not me!

Still quite puzzled where the error Alignment format is not supported! came from though.

It may or may not be a coincidence that the same error was mentioned in #329, also in the context of --cluster-reassign

milot-mirdita commented 3 years ago

I pushed a fix for the two issues I can reproduce. Can you try again?

milot-mirdita commented 3 years ago

I had messed up the push, you were a bit faster than I thought. The right commit is 9290a2b529da9763992bde2e6e0f95c61b003123

s-devos commented 3 years ago

@milot-mirdita Your fix works for the artificial set, but unfortunately, not for my own set. I guess the artificial set has very little similarity between sequences, which should explain the differences in behaviour:

~ See newest comment below for log with reproducible data ~

mmseqs cluster DB_in DB_clu tmp/ --cluster-reassign 1 --cluster-mode 0 --cov-mode 0

Create directory tmp/ cluster DB_in DB_clu tmp/ --cluster-reassign 1 --cluster-mode 0 --cov-mode 0

MMseqs Version: 9290a2b529da9763992bde2e6e0f95c61b003123 Substitution matrix nucl:nucleotide.out,aa:blosum62.out Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 4 k-mer length 0 k-score 2147483647 Alphabet size nucl:5,aa:21 Max sequence length 65535 Max results per query 20 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.8 Coverage mode 0 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask lower case residues 0 Minimum diagonal score 15 Include identical seq. id. false Spaced k-mers 1 Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Spaced k-mer pattern Local temporary path Threads 16 Compressed 0 Verbosity 3 Add backtrace false Alignment mode 3 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max reject 2147483647 Max accept 2147483647 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Zdrop 40 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Cluster mode 0 Max connected component depth 1000 Similarity type 2 Single step clustering false Cascaded clustering steps 3 Cluster reassign true Remove temporary files false Force restart with latest tmp false MPI runner k-mers per sequence 21 Scale k-mers per sequence nucl:0.200,aa:0.000 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false

Set cluster sensitivity to -s 6.000000 Set cluster iterations to 3 linclust DB_in tmp//10798751672030653963/clu_redundancy tmp//10798751672030653963/linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --alph-size nucl:5,aa:13 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-reuse 0

kmermatcher DB_in tmp//10798751672030653963/linclust/5052420726377277994/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 16 --compressed 0 -v 3

Database size: 303 type: Aminoacid Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Generate k-mers list for 1 split [=================================================================] 100.00% 303 0s 20ms Sort kmer 0h 0m 0s 0ms Sort by rep. sequence 0h 0m 0s 0ms Time for fill: 0h 0m 0s 0ms Time for merging to pref: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 92ms rescorediagonal DB_in DB_in tmp//10798751672030653963/linclust/5052420726377277994/pref tmp//10798751672030653963/linclust/5052420726377277994/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 303 0s 28ms Time for merging to pref_rescore1: 0h 0m 0s 3ms===============> ] 94.37% 286 eta 0s Time for processing: 0h 0m 0s 69ms clust DB_in tmp//10798751672030653963/linclust/5052420726377277994/pref_rescore1 tmp//10798751672030653963/linclust/5052420726377277994/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 303 0s 10ms Sort entries Find missing connections Found 29 new connections. Reconstruct initial order [=================================================================] 100.00% 303 0s 9ms Add missing connections [=================================================================] 100.00% 303 0s 1ms

Time for read in: 0h 0m 0s 96ms Total time: 0h 0m 0s 125ms

Size of the sequence database: 303 Size of the alignment database: 303 Number of clusters: 276

Writing results 0h 0m 0s 0ms Time for merging to pre_clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 128ms createsubdb tmp//10798751672030653963/linclust/5052420726377277994/order_redundancy DB_in tmp//10798751672030653963/linclust/5052420726377277994/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms createsubdb tmp//10798751672030653963/linclust/5052420726377277994/order_redundancy tmp//10798751672030653963/linclust/5052420726377277994/pref tmp//10798751672030653963/linclust/5052420726377277994/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms filterdb tmp//10798751672030653963/linclust/5052420726377277994/pref_filter1 tmp//10798751672030653963/linclust/5052420726377277994/pref_filter2 --filter-file tmp//10798751672030653963/linclust/5052420726377277994/order_redundancy --threads 16 --compressed 0 -v 3

Filtering using file(s) [=================================================================] 100.00% 276 0s 5ms Time for merging to pref_filter2: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 20ms rescorediagonal tmp//10798751672030653963/linclust/5052420726377277994/input_step_redundancy tmp//10798751672030653963/linclust/5052420726377277994/input_step_redundancy tmp//10798751672030653963/linclust/5052420726377277994/pref_filter2 tmp//10798751672030653963/linclust/5052420726377277994/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

Can not find any score per column for coverage 0.800000 and sequence identity 0.000000. No hit will be filtered. [=================================================================] 100.00% 276 0s 8ms Time for merging to pref_rescore2: 0h 0m 0s 2ms ] 16.00% 45 eta 0s Time for processing: 0h 0m 0s 37ms align tmp//10798751672030653963/linclust/5052420726377277994/input_step_redundancy tmp//10798751672030653963/linclust/5052420726377277994/input_step_redundancy tmp//10798751672030653963/linclust/5052420726377277994/pref_rescore2 tmp//10798751672030653963/linclust/5052420726377277994/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 276 type: Aminoacid Target database size: 276 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 276 0s 64ms Time for merging to aln: 0h 0m 0s 5ms 276 alignments calculated 276 sequence pairs passed the thresholds (1.000000 of overall calculated) 1.000000 hits per query sequence Time for processing: 0h 0m 0s 80ms clust tmp//10798751672030653963/linclust/5052420726377277994/input_step_redundancy tmp//10798751672030653963/linclust/5052420726377277994/aln tmp//10798751672030653963/linclust/5052420726377277994/clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 276 0s 8ms Sort entries Find missing connections Found 0 new connections. Reconstruct initial order [=================================================================] 100.00% 276 0s 4ms Add missing connections [=================================================================] 100.00% 276 0s 1ms

Time for read in: 0h 0m 0s 38ms Total time: 0h 0m 0s 48ms

Size of the sequence database: 276 Size of the alignment database: 276 Number of clusters: 276

Writing results 0h 0m 0s 0ms Time for merging to clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 51ms mergeclusters DB_in tmp//10798751672030653963/clu_redundancy tmp//10798751672030653963/linclust/5052420726377277994/pre_clust tmp//10798751672030653963/linclust/5052420726377277994/clust --threads 16 --compressed 0 -v 3

Clustering step 1 [=================================================================] 100.00% 276 0s 11ms Clustering step 2 [=================================================================] 100.00% 276 0s 36ms Write merged clustering [=================================================================] 100.00% 303 0s 53ms Time for merging to clu_redundancy: 0h 0m 0s 2ms Time for processing: 0h 0m 0s 67ms createsubdb tmp//10798751672030653963/clu_redundancy DB_in tmp//10798751672030653963/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms prefilter tmp//10798751672030653963/input_step_redundancy tmp//10798751672030653963/input_step_redundancy tmp//10798751672030653963/pref_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 1 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 0 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 276 type: Aminoacid Estimated memory consumption: 978M Target database size: 276 type: Aminoacid Index table k-mer threshold: 154 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 276 0s 30ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 276 0s 3ms Index statistics Entries: 1187 DB size: 488 MB Avg k-mer size: 0.000019 Top 10 k-mers XXXXXX 7 XXXXXX 4 XXXXXX 4 XXXXXX 4 XXXXXX 4 XXXXXX 3 XXXXXX 3 XXXXXX 3 XXXXXX 3 XXXXXX 3 Time for index table init: 0h 0m 0s 848ms Process prefiltering step 1 of 1

k-mer similarity threshold: 154 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 276 Target db start 1 to 276 [=================================================================] 100.00% 276 0s 19ms

1.394095 k-mers per position 5 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step0: 0h 0m 0s 3ms Time for processing: 0h 0m 1s 477ms align tmp//10798751672030653963/input_step_redundancy tmp//10798751672030653963/input_step_redundancy tmp//10798751672030653963/pref_step0 tmp//10798751672030653963/aln_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 276 type: Aminoacid Target database size: 276 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 276 0s 44ms Time for merging to aln_step0: 0h 0m 0s 7ms 415 alignments calculated 408 sequence pairs passed the thresholds (0.983133 of overall calculated) 1.478261 hits per query sequence Time for processing: 0h 0m 0s 89ms clust tmp//10798751672030653963/input_step_redundancy tmp//10798751672030653963/aln_step0 tmp//10798751672030653963/clu_step0 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 276 0s 11ms Sort entries Find missing connections Found 0 new connections. Reconstruct initial order [=================================================================] 100.00% 276 0s 13ms Add missing connections [=================================================================] 100.00% 276 0s 0ms

Time for read in: 0h 0m 0s 83ms Total time: 0h 0m 0s 118ms

Size of the sequence database: 276 Size of the alignment database: 276 Number of clusters: 239

Writing results 0h 0m 0s 0ms Time for merging to clu_step0: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 125ms createsubdb tmp//10798751672030653963/clu_step0 tmp//10798751672030653963/input_step_redundancy tmp//10798751672030653963/input_step1 -v 3 --subdb-mode 1

Time for merging to input_step1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 2ms prefilter tmp//10798751672030653963/input_step1 tmp//10798751672030653963/input_step1 tmp//10798751672030653963/pref_step1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 3.5 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 239 type: Aminoacid Estimated memory consumption: 977M Target database size: 239 type: Aminoacid Index table k-mer threshold: 131 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 239 0s 16ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 239 0s 6ms Index statistics Entries: 1415 DB size: 488 MB Avg k-mer size: 0.000022 Top 10 k-mers XXXXXX 3 XXXXXX 3 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 Time for index table init: 0h 0m 1s 50ms Process prefiltering step 1 of 1

k-mer similarity threshold: 131 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 239 Target db start 1 to 239 [=================================================================] 100.00% 239 0s 20ms [================================================================>] 99.58% 238 eta 0s 20.876247 k-mers per position 6 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step1: 0h 0m 0s 2ms Time for processing: 0h 0m 1s 824ms align tmp//10798751672030653963/input_step1 tmp//10798751672030653963/input_step1 tmp//10798751672030653963/pref_step1 tmp//10798751672030653963/aln_step1 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 239 type: Aminoacid Target database size: 239 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 239 0s 46ms Time for merging to aln_step1: 0h 0m 0s 11ms 308 alignments calculated 270 sequence pairs passed the thresholds (0.876623 of overall calculated) 1.129707 hits per query sequence Time for processing: 0h 0m 0s 78ms clust tmp//10798751672030653963/input_step1 tmp//10798751672030653963/aln_step1 tmp//10798751672030653963/clu_step1 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 239 0s 12ms Sort entries Find missing connections Found 7 new connections. Reconstruct initial order [=================================================================] 100.00% 239 0s 3ms Add missing connections [=================================================================] 100.00% 239 0s 0ms

Time for read in: 0h 0m 0s 50ms Total time: 0h 0m 0s 71ms

Size of the sequence database: 239 Size of the alignment database: 239 Number of clusters: 222

Writing results 0h 0m 0s 0ms Time for merging to clu_step1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 73ms createsubdb tmp//10798751672030653963/clu_step1 tmp//10798751672030653963/input_step1 tmp//10798751672030653963/input_step2 -v 3 --subdb-mode 1

Time for merging to input_step2: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms prefilter tmp//10798751672030653963/input_step2 tmp//10798751672030653963/input_step2 tmp//10798751672030653963/pref_step2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 6 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 222 type: Aminoacid Estimated memory consumption: 977M Target database size: 222 type: Aminoacid Index table k-mer threshold: 109 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 222 0s 24ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 222 0s 3ms Index statistics Entries: 1342 DB size: 488 MB Avg k-mer size: 0.000021 Top 10 k-mers XXXXXX 3 XXXXXX 3 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 Time for index table init: 0h 0m 0s 970ms Process prefiltering step 1 of 1

k-mer similarity threshold: 109 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 222 Target db start 1 to 222 [=================================================================] 100.00% 222 0s 52ms

196.811469 k-mers per position 8 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step2: 0h 0m 0s 11ms Time for processing: 0h 0m 1s 704ms align tmp//10798751672030653963/input_step2 tmp//10798751672030653963/input_step2 tmp//10798751672030653963/pref_step2 tmp//10798751672030653963/aln_step2 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 222 type: Aminoacid Target database size: 222 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 222 0s 80ms Time for merging to aln_step2: 0h 0m 0s 11ms 388 alignments calculated 266 sequence pairs passed the thresholds (0.685567 of overall calculated) 1.198198 hits per query sequence Time for processing: 0h 0m 0s 106ms clust tmp//10798751672030653963/input_step2 tmp//10798751672030653963/aln_step2 tmp//10798751672030653963/clu_step2 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 222 0s 2ms Sort entries Find missing connections Found 6 new connections. Reconstruct initial order [=================================================================] 100.00% 222 0s 5ms Add missing connections [=================================================================] 100.00% 222 0s 0ms

Time for read in: 0h 0m 0s 56ms Total time: 0h 0m 0s 62ms

Size of the sequence database: 222 Size of the alignment database: 222 Number of clusters: 200

Writing results 0h 0m 0s 10ms Time for merging to clu_step2: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 75ms mergeclusters DB_in tmp//10798751672030653963/clu tmp//10798751672030653963/clu_redundancy tmp//10798751672030653963/clu_step0 tmp//10798751672030653963/clu_step1 tmp//10798751672030653963/clu_step2

Clustering step 1 [=================================================================] 100.00% 276 0s 5ms Clustering step 2 [=================================================================] 100.00% 239 0s 25ms Clustering step 3 [=================================================================] 100.00% 222 0s 44ms Clustering step 4 [=================================================================] 100.00% 200 0s 61ms Write merged clustering [=================================================================] 100.00% 303 0s 73ms Time for merging to clu: 0h 0m 0s 10ms Time for processing: 0h 0m 0s 88ms align DB_in DB_in tmp//10798751672030653963/clu tmp//10798751672030653963/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 303 type: Aminoacid Target database size: 303 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 200 0s 20ms Time for merging to aln: 0h 0m 0s 1ms 303 alignments calculated 289 sequence pairs passed the thresholds (0.953795 of overall calculated) 1.445000 hits per query sequence Time for processing: 0h 0m 0s 47ms subtractdbs tmp//10798751672030653963/clu tmp//10798751672030653963/aln tmp//10798751672030653963/clu_not_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

subtractdbs tmp//10798751672030653963/clu tmp//10798751672030653963/aln tmp//10798751672030653963/clu_not_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

Remove tmp//10798751672030653963/aln ids from tmp//10798751672030653963/clu [=================================================================] 100.00% 200 0s 19ms Time for merging to clu_not_accepted: 0h 0m 0s 4ms Time for processing: 0h 0m 0s 32ms swapdb tmp//10798751672030653963/clu_not_accepted tmp//10798751672030653963/clu_not_accepted_swap --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 200 0s 10ms Computing offsets. [=================================================================] 100.00% 200 0s 5ms

Reading results. [=================================================================] 100.00% 200 0s 15ms

Output database: tmp//10798751672030653963/clu_not_accepted_swap [=================================================================] 100.00% 273 0s 7ms

Time for merging to clu_not_accepted_swap: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 105ms subtractdbs tmp//10798751672030653963/clu tmp//10798751672030653963/clu_not_accepted tmp//10798751672030653963/clu_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

subtractdbs tmp//10798751672030653963/clu tmp//10798751672030653963/clu_not_accepted tmp//10798751672030653963/clu_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

Remove tmp//10798751672030653963/clu_not_accepted ids from tmp//10798751672030653963/clu [=================================================================] 100.00% 200 0s 22ms Time for merging to clu_accepted: 0h 0m 0s 4ms Time for processing: 0h 0m 0s 40ms createsubdb tmp//10798751672030653963/clu_not_accepted_swap DB_in tmp//10798751672030653963/seq_wrong_assigned -v 3

Time for merging to seq_wrong_assigned: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms createsubdb tmp//10798751672030653963/clu DB_in tmp//10798751672030653963/seq_seeds -v 3

Time for merging to seq_seeds: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 2ms prefilter tmp//10798751672030653963/seq_wrong_assigned tmp//10798751672030653963/seq_seeds.merged tmp//10798751672030653963/seq_wrong_assigned_pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 6 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 14 type: Aminoacid Estimated memory consumption: 977M Target database size: 214 type: Aminoacid Index table k-mer threshold: 109 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 214 0s 24ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 214 0s 9ms Index statistics Entries: 1312 DB size: 488 MB Avg k-mer size: 0.000021 Top 10 k-mers XXXXXX 3 XXXXXX 3 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 XXXXXX 2 Time for index table init: 0h 0m 0s 857ms Process prefiltering step 1 of 1

k-mer similarity threshold: 109 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 14 Target db start 1 to 214 [=================================================================] 100.00% 14 0s 45ms

376.012940 k-mers per position 14 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 3 sequences passed prefiltering per query sequence 3 median result list length 0 sequences with 0 size result lists Time for merging to seq_wrong_assigned_pref: 0h 0m 0s 6ms Time for processing: 0h 0m 1s 611ms swapdb tmp//10798751672030653963/seq_wrong_assigned_pref tmp//10798751672030653963/seq_wrong_assigned_pref_swaped --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 14 0s 4ms Computing offsets. [=================================================================] 100.00% 14 0s 2ms

Reading results. [=================================================================] 100.00% 14 0s 4ms

Output database: tmp//10798751672030653963/seq_wrong_assigned_pref_swaped [=================================================================] 100.00% 284 0s 6ms

Time for merging to seq_wrong_assigned_pref_swaped: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 87ms align tmp//10798751672030653963/seq_seeds.merged tmp//10798751672030653963/seq_wrong_assigned tmp//10798751672030653963/seq_wrong_assigned_pref_swaped tmp//10798751672030653963/seq_wrong_assigned_pref_swaped_aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 214 type: Aminoacid Target database size: 14 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 29 0s 13ms Time for merging to seq_wrong_assigned_pref_swaped_aln: 0h 0m 0s 5ms 41 alignments calculated 32 sequence pairs passed the thresholds (0.780488 of overall calculated) 1.103448 hits per query sequence Time for processing: 0h 0m 0s 56ms filterdb tmp//10798751672030653963/seq_wrong_assigned_pref_swaped_aln tmp//10798751672030653963/seq_wrong_assigned_pref_swaped_aln_ocol --trim-to-one-column --threads 16 --compressed 0 -v 3

Filtering using regular expression [=================================================================] 100.00% 29 0s 8ms Time for merging to seq_wrong_assigned_pref_swaped_aln_ocol: 0h 0m 0s 3ms Time for processing: 0h 0m 0s 51ms mergedbs tmp//10798751672030653963/seq_seeds.merged tmp//10798751672030653963/clu_accepted_plus_wrong tmp//10798751672030653963/clu_accepted tmp//10798751672030653963/seq_wrong_assigned_pref_swaped_aln_ocol --merge-stop-empty 0 --compressed 0 -v 3

Merging the results to tmp//10798751672030653963/clu_accepted_plus_wrong [=================================================================] 100.00% 214 0s 1ms Time for merging to clu_accepted_plus_wrong: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 2ms tsv2db tmp//10798751672030653963/missing.single.seqs tmp//10798751672030653963/missing.single.seqs.db --output-dbtype 6 --compressed 0 -v 3

Output database type: Clustering Time for merging to missing.single.seqs.db: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms mergedbs DB_in tmp//10798751672030653963/clu_accepted_plus_wrong_plus_single tmp//10798751672030653963/clu_accepted_plus_wrong tmp//10798751672030653963/missing.single.seqs.db --merge-stop-empty 0 --compressed 0 -v 3

Merging the results to tmp//10798751672030653963/clu_accepted_plus_wrong_plus_single [=================================================================] 100.00% 303 0s 0ms Time for merging to clu_accepted_plus_wrong_plus_single: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms clust DB_in tmp//10798751672030653963/clu_accepted_plus_wrong_plus_single DB_clu --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 303 0s 8ms Sort entries Find missing connections Found 99 new connections. Reconstruct initial order Alignment format is not supported! ] 0.00% 1 eta - Alignment format is not supported! Alignment format is not supported! *** Error in Alignment format is not supported! Segmentation fault (core dumped) Error: Clustering step 2 died

s-devos commented 3 years ago

New artificial set, with a few highly similar sequences added (10 AA and 7 AA long) artificial2.txt

commands: mmseqs createdb artificial2.fasta artificial2/DB_artificial mmseqs cluster artificial2/DB_artificial db_clu tmp/ --cluster-reassign 1 --cov-mode 0 --cluster-mode 0

output:

Create directory tmp/ cluster artificial2/DB_artificial db_clu tmp/ --cluster-reassign 1 --cov-mode 0 --cluster-mode 0

MMseqs Version: 9290a2b529da9763992bde2e6e0f95c61b003123 Substitution matrix nucl:nucleotide.out,aa:blosum62.out Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 4 k-mer length 0 k-score 2147483647 Alphabet size nucl:5,aa:21 Max sequence length 65535 Max results per query 20 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.8 Coverage mode 0 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask lower case residues 0 Minimum diagonal score 15 Include identical seq. id. false Spaced k-mers 1 Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Spaced k-mer pattern Local temporary path Threads 16 Compressed 0 Verbosity 3 Add backtrace false Alignment mode 3 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max reject 2147483647 Max accept 2147483647 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Zdrop 40 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Cluster mode 0 Max connected component depth 1000 Similarity type 2 Single step clustering false Cascaded clustering steps 3 Cluster reassign true Remove temporary files false Force restart with latest tmp false MPI runner k-mers per sequence 21 Scale k-mers per sequence nucl:0.200,aa:0.000 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false

Set cluster sensitivity to -s 6.000000 Set cluster iterations to 3 linclust artificial2/DB_artificial tmp//7897776346521331899/clu_redundancy tmp//7897776346521331899/linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --alph-size nucl:5,aa:13 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-reuse 0

kmermatcher artificial2/DB_artificial tmp//7897776346521331899/linclust/17269245559208916342/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 16 --compressed 0 -v 3

Database size: 136 type: Aminoacid Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Generate k-mers list for 1 split [=================================================================] 100.00% 136 0s 41ms Sort kmer 0h 0m 0s 0ms Sort by rep. sequence 0h 0m 0s 0ms Time for fill: 0h 0m 0s 0ms Time for merging to pref: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 149ms rescorediagonal artificial2/DB_artificial artificial2/DB_artificial tmp//7897776346521331899/linclust/17269245559208916342/pref tmp//7897776346521331899/linclust/17269245559208916342/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 136 0s 18ms Time for merging to pref_rescore1: 0h 0m 0s 6ms Time for processing: 0h 0m 0s 48ms clust artificial2/DB_artificial tmp//7897776346521331899/linclust/17269245559208916342/pref_rescore1 tmp//7897776346521331899/linclust/17269245559208916342/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 136 0s 1ms Sort entries Find missing connections Found 9 new connections. Reconstruct initial order [=================================================================] 100.00% 136 0s 2ms Add missing connections [=================================================================] 100.00% 136 0s 0ms

Time for read in: 0h 0m 0s 30ms Total time: 0h 0m 0s 33ms

Size of the sequence database: 136 Size of the alignment database: 136 Number of clusters: 127

Writing results 0h 0m 0s 0ms Time for merging to pre_clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 36ms createsubdb tmp//7897776346521331899/linclust/17269245559208916342/order_redundancy artificial2/DB_artificial tmp//7897776346521331899/linclust/17269245559208916342/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms createsubdb tmp//7897776346521331899/linclust/17269245559208916342/order_redundancy tmp//7897776346521331899/linclust/17269245559208916342/pref tmp//7897776346521331899/linclust/17269245559208916342/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms filterdb tmp//7897776346521331899/linclust/17269245559208916342/pref_filter1 tmp//7897776346521331899/linclust/17269245559208916342/pref_filter2 --filter-file tmp//7897776346521331899/linclust/17269245559208916342/order_redundancy --threads 16 --compressed 0 -v 3

Filtering using file(s) [=================================================================] 100.00% 127 0s 8ms Time for merging to pref_filter2: 0h 0m 0s 3ms Time for processing: 0h 0m 0s 44ms rescorediagonal tmp//7897776346521331899/linclust/17269245559208916342/input_step_redundancy tmp//7897776346521331899/linclust/17269245559208916342/input_step_redundancy tmp//7897776346521331899/linclust/17269245559208916342/pref_filter2 tmp//7897776346521331899/linclust/17269245559208916342/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 16 --compressed 0 -v 3

Can not find any score per column for coverage 0.800000 and sequence identity 0.000000. No hit will be filtered. [=================================================================] 100.00% 127 0s 5ms Time for merging to pref_rescore2: 0h 0m 0s 2ms=====> ] 78.57% 100 eta 0s Time for processing: 0h 0m 0s 27ms align tmp//7897776346521331899/linclust/17269245559208916342/input_step_redundancy tmp//7897776346521331899/linclust/17269245559208916342/input_step_redundancy tmp//7897776346521331899/linclust/17269245559208916342/pref_rescore2 tmp//7897776346521331899/linclust/17269245559208916342/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 127 type: Aminoacid Target database size: 127 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 127 0s 17ms Time for merging to aln: 0h 0m 0s 1ms 130 alignments calculated 130 sequence pairs passed the thresholds (1.000000 of overall calculated) 1.023622 hits per query sequence Time for processing: 0h 0m 0s 57ms clust tmp//7897776346521331899/linclust/17269245559208916342/input_step_redundancy tmp//7897776346521331899/linclust/17269245559208916342/aln tmp//7897776346521331899/linclust/17269245559208916342/clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 127 0s 3ms Sort entries Find missing connections Found 3 new connections. Reconstruct initial order [=================================================================] 100.00% 127 0s 10ms Add missing connections [=================================================================] 100.00% 127 0s 0ms

Time for read in: 0h 0m 0s 66ms Total time: 0h 0m 0s 82ms

Size of the sequence database: 127 Size of the alignment database: 127 Number of clusters: 124

Writing results 0h 0m 0s 0ms Time for merging to clust: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 84ms mergeclusters artificial2/DB_artificial tmp//7897776346521331899/clu_redundancy tmp//7897776346521331899/linclust/17269245559208916342/pre_clust tmp//7897776346521331899/linclust/17269245559208916342/clust --threads 16 --compressed 0 -v 3

Clustering step 1 [=================================================================] 100.00% 127 0s 6ms Clustering step 2 [=================================================================] 100.00% 124 0s 23ms Write merged clustering [=================================================================] 100.00% 136 0s 37ms Time for merging to clu_redundancy: 0h 0m 0s 3ms Time for processing: 0h 0m 0s 54ms createsubdb tmp//7897776346521331899/clu_redundancy artificial2/DB_artificial tmp//7897776346521331899/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms prefilter tmp//7897776346521331899/input_step_redundancy tmp//7897776346521331899/input_step_redundancy tmp//7897776346521331899/pref_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 1 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 0 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 124 type: Aminoacid Estimated memory consumption: 977M Target database size: 124 type: Aminoacid Index table k-mer threshold: 154 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 124 0s 51ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 124 0s 1ms Index statistics Entries: 99 DB size: 488 MB Avg k-mer size: 0.000002 Top 10 k-mers FSMYPQ 6 HFVFHR 4 YQYPRV 3 LAMYPA 1 CHMEKC 1 VQRKKC 1 RGYLLC 1 MVQDRC 1 CEMRRC 1 ERIATC 1 Time for index table init: 0h 0m 1s 150ms Process prefiltering step 1 of 1

k-mer similarity threshold: 154 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 124 Target db start 1 to 124 [=================================================================] 100.00% 124 0s 29ms

0.916862 k-mers per position 1 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step0: 0h 0m 0s 1ms Time for processing: 0h 0m 1s 849ms align tmp//7897776346521331899/input_step_redundancy tmp//7897776346521331899/input_step_redundancy tmp//7897776346521331899/pref_step0 tmp//7897776346521331899/aln_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 124 type: Aminoacid Target database size: 124 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 124 0s 23ms Time for merging to aln_step0: 0h 0m 0s 1ms 124 alignments calculated 124 sequence pairs passed the thresholds (1.000000 of overall calculated) 1.000000 hits per query sequence Time for processing: 0h 0m 0s 72ms clust tmp//7897776346521331899/input_step_redundancy tmp//7897776346521331899/aln_step0 tmp//7897776346521331899/clu_step0 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 124 0s 3ms Sort entries Find missing connections Found 0 new connections. Reconstruct initial order [=================================================================] 100.00% 124 0s 0ms Add missing connections [=================================================================] 100.00% 124 0s 0ms

Time for read in: 0h 0m 0s 46ms Total time: 0h 0m 0s 72ms

Size of the sequence database: 124 Size of the alignment database: 124 Number of clusters: 124

Writing results 0h 0m 0s 0ms Time for merging to clu_step0: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 75ms createsubdb tmp//7897776346521331899/clu_step0 tmp//7897776346521331899/input_step_redundancy tmp//7897776346521331899/input_step1 -v 3 --subdb-mode 1

Time for merging to input_step1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms prefilter tmp//7897776346521331899/input_step1 tmp//7897776346521331899/input_step1 tmp//7897776346521331899/pref_step1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 3.5 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 124 type: Aminoacid Estimated memory consumption: 977M Target database size: 124 type: Aminoacid Index table k-mer threshold: 131 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 124 0s 30ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 124 0s 5ms Index statistics Entries: 116 DB size: 488 MB Avg k-mer size: 0.000002 Top 10 k-mers FSMYPQ 6 HFVFHR 4 YQYPRV 3 LAMYPA 1 ARPIVA 1 CHMEKC 1 VQRKKC 1 RGYLLC 1 MVQDRC 1 CEMRRC 1 Time for index table init: 0h 0m 0s 908ms Process prefiltering step 1 of 1

k-mer similarity threshold: 131 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 124 Target db start 1 to 124 [=================================================================] 100.00% 124 0s 24ms

15.668402 k-mers per position 1 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step1: 0h 0m 0s 4ms Time for processing: 0h 0m 1s 591ms align tmp//7897776346521331899/input_step1 tmp//7897776346521331899/input_step1 tmp//7897776346521331899/pref_step1 tmp//7897776346521331899/aln_step1 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 124 type: Aminoacid Target database size: 124 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 124 0s 26ms Time for merging to aln_step1: 0h 0m 0s 3ms=================> ] 91.87% 114 eta 0s 124 alignments calculated 124 sequence pairs passed the thresholds (1.000000 of overall calculated) 1.000000 hits per query sequence Time for processing: 0h 0m 0s 62ms clust tmp//7897776346521331899/input_step1 tmp//7897776346521331899/aln_step1 tmp//7897776346521331899/clu_step1 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 124 0s 5ms Sort entries Find missing connections Found 0 new connections. Reconstruct initial order [=================================================================] 100.00% 124 0s 11ms Add missing connections [=================================================================] 100.00% 124 0s 0ms

Time for read in: 0h 0m 0s 87ms Total time: 0h 0m 0s 107ms

Size of the sequence database: 124 Size of the alignment database: 124 Number of clusters: 124

Writing results 0h 0m 0s 0ms Time for merging to clu_step1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 110ms createsubdb tmp//7897776346521331899/clu_step1 tmp//7897776346521331899/input_step1 tmp//7897776346521331899/input_step2 -v 3 --subdb-mode 1

Time for merging to input_step2: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms prefilter tmp//7897776346521331899/input_step2 tmp//7897776346521331899/input_step2 tmp//7897776346521331899/pref_step2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 6 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 124 type: Aminoacid Estimated memory consumption: 977M Target database size: 124 type: Aminoacid Index table k-mer threshold: 109 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 124 0s 19ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 124 0s 3ms Index statistics Entries: 116 DB size: 488 MB Avg k-mer size: 0.000002 Top 10 k-mers FSMYPQ 6 HFVFHR 4 YQYPRV 3 LAMYPA 1 ARPIVA 1 CHMEKC 1 VQRKKC 1 RGYLLC 1 MVQDRC 1 CEMRRC 1 Time for index table init: 0h 0m 0s 975ms Process prefiltering step 1 of 1

k-mer similarity threshold: 109 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 124 Target db start 1 to 124 [=================================================================] 100.00% 124 0s 35ms

125.182478 k-mers per position 1 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 1 sequences passed prefiltering per query sequence 1 median result list length 0 sequences with 0 size result lists Time for merging to pref_step2: 0h 0m 0s 3ms Time for processing: 0h 0m 1s 657ms align tmp//7897776346521331899/input_step2 tmp//7897776346521331899/input_step2 tmp//7897776346521331899/pref_step2 tmp//7897776346521331899/aln_step2 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 124 type: Aminoacid Target database size: 124 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 124 0s 37ms Time for merging to aln_step2: 0h 0m 0s 8ms 124 alignments calculated 124 sequence pairs passed the thresholds (1.000000 of overall calculated) 1.000000 hits per query sequence Time for processing: 0h 0m 0s 80ms clust tmp//7897776346521331899/input_step2 tmp//7897776346521331899/aln_step2 tmp//7897776346521331899/clu_step2 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 124 0s 13ms Sort entries Find missing connections Found 0 new connections. Reconstruct initial order [=================================================================] 100.00% 124 0s 8ms Add missing connections [=================================================================] 100.00% 124 0s 0ms

Time for read in: 0h 0m 0s 95ms Total time: 0h 0m 0s 124ms

Size of the sequence database: 124 Size of the alignment database: 124 Number of clusters: 124

Writing results 0h 0m 0s 0ms Time for merging to clu_step2: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 127ms mergeclusters artificial2/DB_artificial tmp//7897776346521331899/clu tmp//7897776346521331899/clu_redundancy tmp//7897776346521331899/clu_step0 tmp//7897776346521331899/clu_step1 tmp//7897776346521331899/clu_step2

Clustering step 1 [=================================================================] 100.00% 124 0s 5ms Clustering step 2 [=================================================================] 100.00% 124 0s 27ms Clustering step 3 [=================================================================] 100.00% 124 0s 45ms Clustering step 4 [=================================================================] 100.00% 124 0s 64ms Write merged clustering [=================================================================] 100.00% 136 0s 79ms Time for merging to clu: 0h 0m 0s 7ms Time for processing: 0h 0m 0s 99ms align artificial2/DB_artificial artificial2/DB_artificial tmp//7897776346521331899/clu tmp//7897776346521331899/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 136 type: Aminoacid Target database size: 136 type: Aminoacid Calculation of alignments [=================================================================] 100.00% 124 0s 17ms Time for merging to aln: 0h 0m 0s 2ms 136 alignments calculated 134 sequence pairs passed the thresholds (0.985294 of overall calculated) 1.080645 hits per query sequence Time for processing: 0h 0m 0s 43ms subtractdbs tmp//7897776346521331899/clu tmp//7897776346521331899/aln tmp//7897776346521331899/clu_not_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

subtractdbs tmp//7897776346521331899/clu tmp//7897776346521331899/aln tmp//7897776346521331899/clu_not_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

Remove tmp//7897776346521331899/aln ids from tmp//7897776346521331899/clu [=================================================================] 100.00% 124 0s 13ms Time for merging to clu_not_accepted: 0h 0m 0s 4ms Time for processing: 0h 0m 0s 26ms swapdb tmp//7897776346521331899/clu_not_accepted tmp//7897776346521331899/clu_not_accepted_swap --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 124 0s 8ms Computing offsets. [=================================================================] 100.00% 124 0s 5ms

Reading results. [=================================================================] 100.00% 124 0s 7ms

Output database: tmp//7897776346521331899/clu_not_accepted_swap [=================================================================] 100.00% 133 0s 2ms

Time for merging to clu_not_accepted_swap: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 70ms subtractdbs tmp//7897776346521331899/clu tmp//7897776346521331899/clu_not_accepted tmp//7897776346521331899/clu_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

subtractdbs tmp//7897776346521331899/clu tmp//7897776346521331899/clu_not_accepted tmp//7897776346521331899/clu_accepted --e-profile 100000000 -e 100000000 --threads 16 --compressed 0 -v 3

Remove tmp//7897776346521331899/clu_not_accepted ids from tmp//7897776346521331899/clu [=================================================================] 100.00% 124 0s 19ms Time for merging to clu_accepted: 0h 0m 0s 6ms Time for processing: 0h 0m 0s 37ms createsubdb tmp//7897776346521331899/clu_not_accepted_swap artificial2/DB_artificial tmp//7897776346521331899/seq_wrong_assigned -v 3

Time for merging to seq_wrong_assigned: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms createsubdb tmp//7897776346521331899/clu artificial2/DB_artificial tmp//7897776346521331899/seq_seeds -v 3

Time for merging to seq_seeds: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 0ms prefilter tmp//7897776346521331899/seq_wrong_assigned tmp//7897776346521331899/seq_seeds.merged tmp//7897776346521331899/seq_wrong_assigned_pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 6 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 16 --compressed 0 -v 3

Query database size: 2 type: Aminoacid Estimated memory consumption: 977M Target database size: 126 type: Aminoacid Index table k-mer threshold: 109 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 126 0s 29ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 126 0s 1ms Index statistics Entries: 116 DB size: 488 MB Avg k-mer size: 0.000002 Top 10 k-mers FSMYPQ 6 HFVFHR 4 YQYPRV 3 LAMYPA 1 ARPIVA 1 CHMEKC 1 VQRKKC 1 RGYLLC 1 MVQDRC 1 CEMRRC 1 Time for index table init: 0h 0m 0s 981ms Process prefiltering step 1 of 1

k-mer similarity threshold: 109 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 2 Target db start 1 to 126 [=================================================================] 100.00% 2 0s 4ms

0.000000 k-mers per position 0 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 0 sequences passed prefiltering per query sequence 0 median result list length 2 sequences with 0 size result lists Time for merging to seq_wrong_assigned_pref: 0h 0m 0s 0ms Time for processing: 0h 0m 1s 582ms swapdb tmp//7897776346521331899/seq_wrong_assigned_pref tmp//7897776346521331899/seq_wrong_assigned_pref_swaped --threads 16 --compressed 0 -v 3

[=================================================================] 100.00% 2 0s 2ms Computing offsets. [=================================================================] 100.00% 2 0s 4ms

Reading results. [=================================================================] 100.00% 2 0s 3ms

Output database: tmp//7897776346521331899/seq_wrong_assigned_pref_swaped [=================================================================] 100.00% 1 eta -

Time for merging to seq_wrong_assigned_pref_swaped: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 110ms 47 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 16 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 126 type: Aminoacid Target database size: 2 type: Aminoacid Calculation of alignments Time for merging to seq_wrong_assigned_pref_swaped_aln: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 7ms filterdb tmp//7897776346521331899/seq_wrong_assigned_pref_swaped_aln tmp//7897776346521331899/seq_wrong_assigned_pref_swaped_aln_ocol --trim-to-one-column --threads 16 --compressed 0 -v 3

Filtering using regular expression mergedbs tmp//7897776346521331899/seq_seeds.merged tmp//7897776346521331899/clu_accepted_plus_wrong tmp//7897776346521331899/clu_accepted tmp//7897776346521331899/seq_wrong_assigned_pref_swaped_aln_ocol --merge-stop-empty 0 --compressed 0 -v 3

Merging the results to tmp//7897776346521331899/clu_accepted_plus_wrong [=================================================================] 100.00% 126 0s 2ms Time for merging to clu_accepted_plus_wrong: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 4ms tsv2db tmp//7897776346521331899/missing.single.seqs tmp//7897776346521331899/missing.single.seqs.db --output-dbtype 6 --compressed 0 -v 3

Output database type: Clustering Time for merging to missing.single.seqs.db: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms mergedbs artificial2/DB_artificial tmp//7897776346521331899/clu_accepted_plus_wrong_plus_single tmp//7897776346521331899/clu_accepted_plus_wrong tmp//7897776346521331899/missing.single.seqs.db --merge-stop-empty 0 --compressed 0 -v 3

Merging the results to tmp//7897776346521331899/clu_accepted_plus_wrong_plus_single [=================================================================] 100.00% 136 0s 0ms Time for merging to clu_accepted_plus_wrong_plus_single: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 2ms clust artificial2/DB_artificial tmp//7897776346521331899/clu_accepted_plus_wrong_plus_single db_clu --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 16 --compressed 0 -v 3

Clustering mode: Set Cover [=================================================================] 100.00% 136 0s 3ms Sort entries Find missing connections Found 10 new connections. Reconstruct initial order Alignment format is not supported! ] 0.00% 1 eta - Alignment format is not supported! Error: Clustering step 2 died

s-devos commented 3 years ago

Fortunately, the --spaced-kmer-pattern does not give errors anymore. Nonetheless, using this option with -k 5 and --mask 0 results in the same errors

milot-mirdita commented 3 years ago

So I traced back where this message comes from. Seems like the current cluster reassignment procedure will only work with greedy incremental clustering.

Not sure if it can be made to work with other cluster modes easily. I guess it should automatically choose greedy if --cluster-reassign is used.

s-devos commented 3 years ago

I am currently doing a benchmark on clustering algorithms, where I find that Greedy Set Cover would also hugely benefit from this option. This is in line with the guidelines, describing --cluster-reassign as the one solution for the cascaded clustering caveat; without this option, there is no certainty that clustering criteria remain fulfilled over multiple cascade steps due to changing representatives.

milot-mirdita commented 3 years ago

a19f5a526012b849a723935acf56d10f39d24611 should solve the issue with Alignment format is not supported!

s-devos commented 3 years ago

Many thanks; cluster reassign seems to be working well with all clustering modes now!

I'm not sure if the k-mer problem is solved yet. On my proprietary dataset, I find that adding a small k-mer size with -k 5 negates anything below 13 AA long to be clustered. When I do not manually set k, some are clustered. But I cannot reproduce the problem yet with an artificial set. I will get back at it in a new issue

milot-mirdita commented 3 years ago

Are you also providing a shorter spaced pattern? The default spaced pattern for 5-mers is 12 characters long. The shorters sequence that can be found is therefore 13 characters long.

s-devos commented 3 years ago

Ah, that must be it! Thank you so much for you devotion! Both issues can be closed, then.

milot-mirdita commented 3 years ago

Great, let us know if there is any other issues :)

soedinglab / MMseqs2

Cascaded clustering dies with --cluster-reassign option #400

374