there must be an error: xx deleted from xx that now is empty, but not assigned to a cluster

Xiaoyu2425 commented 1 year ago

Expected Behavior

Hi! I'm clustering ~60000 nucleotide sequences, each 250bp to 500bp.
Here is the command I use mmseqs cluster shrimpDB shrimp.clu97 tmp97 --cluster-mode 0 -s 7.5 --min-seq-id 0.97 --min-aln-len 200

Current Behavior

I got several lines of error messages: there must be an error: 50016 deleted from 30868 that now is empty, but not assigned to a cluster there must be an error: 33062 deleted from 15885 that now is empty, but not assigned to a cluster there must be an error: 40430 deleted from 27586 that now is empty, but not assigned to a cluster there must be an error: 13350 deleted from 28482 that now is empty, but not assigned to a cluster there must be an error: 11573 deleted from 27334 that now is empty, but not assigned to a cluster

MMseqs Output (for bugs)

(base) [xshan@node422 blast]$ mmseqs cluster shrimpDB shrimp.clu97 tmp97 --cluster-mode 0 -s 7.5 --min-seq-id 0.97 --min-aln-len 200 Create directory tmp97 cluster shrimpDB shrimp.clu97 tmp97 --cluster-mode 0 -s 7.5 --min-seq-id 0.97 --min-aln-len 200

MMseqs Version: bb0a1b3569b9fe115f3bf63e5ba1da234748de23 Substitution matrix aa:blosum62.out,nucl:nucleotide.out Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out Sensitivity 7.5 k-mer length 15 Target search mode 0 k-score seq:2147483647,prof:2147483647 Alphabet size aa:21,nucl:5 Max sequence length 10000 Max results per query 20 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.8 Coverage mode 0 Compositional bias 1 Compositional bias 1 Diagonal scoring false Exact k-mer matching 1 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Minimum diagonal score 15 Selected taxa Include identical seq. id. false Spaced k-mers 1 Preload mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Spaced k-mer pattern Local temporary path Threads 20 Compressed 0 Verbosity 3 Add backtrace false Alignment mode 3 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0.97 Min alignment length 200 Seq. id. mode 0 Alternative alignments 0 Max reject 2147483647 Max accept 2147483647 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Correlation score weight 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Zdrop 40 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Cluster mode 0 Max connected component depth 1000 Similarity type 2 Weight file name Cluster Weight threshold 0.9 Single step clustering false Cascaded clustering steps 3 Cluster reassign false Remove temporary files false Force restart with latest tmp false MPI runner k-mers per sequence 21 Scale k-mers per sequence aa:0.000,nucl:0.200 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false

Set cluster iterations to 3 linclust shrimpDB tmp97/17949317426677965256/clu_redundancy tmp97/17949317426677965256/linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 20 --compressed 0 -v 3 --cluster-weight-threshold 0.9 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.97 --min-aln-len 200 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 10000 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --alph-size aa:21,nucl:5 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-reuse 0

kmermatcher shrimpDB tmp97/17949317426677965256/linclust/6279588666755106708/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:21,nucl:5 --min-seq-id 0.97 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 10000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 20 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Database size: 51992 type: Nucleotide

Generate k-mers list for 1 split [=================================================================] 100.00% 51.99K 0s 346ms

Adjusted k-mer length 17 Sort kmer 0h 0m 0s 62ms Sort by rep. sequence 0h 0m 0s 19ms Time for fill: 0h 0m 0s 27ms Time for merging to pref: 0h 0m 0s 3ms Time for processing: 0h 0m 0s 567ms rescorediagonal shrimpDB shrimpDB tmp97/17949317426677965256/linclust/6279588666755106708/pref tmp97/17949317426677965256/linclust/6279588666755106708/pref_rescore1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 200 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 20 --compressed 0 -v 3

[=================================================================] 100.00% 51.99K 0s 101ms Time for merging to pref_rescore1: 0h 0m 0s 185ms================>] 99.62% 51.80K eta 0s Time for processing: 0h 5m 0s 451ms clust shrimpDB tmp97/17949317426677965256/linclust/6279588666755106708/pref_rescore1 tmp97/17949317426677965256/linclust/6279588666755106708/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 20 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover [=================================================================] 100.00% 51.99K 0s 51ms Sort entries Find missing connections Found 39230 new connections. Reconstruct initial order [=================================================================] 100.00% 51.99K 0s 55ms Add missing connections [=================================================================] 100.00% 51.99K 0s 6ms

Time for read in: 0h 0m 0s 174ms Total time: 0h 0m 0s 197ms

Size of the sequence database: 51992 Size of the alignment database: 51992 Number of clusters: 25629

Writing results 0h 0m 0s 10ms Time for merging to pre_clust: 0h 0m 0s 2ms Time for processing: 0h 0m 0s 256ms createsubdb tmp97/17949317426677965256/linclust/6279588666755106708/order_redundancy shrimpDB tmp97/17949317426677965256/linclust/6279588666755106708/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 3ms Time for processing: 0h 0m 0s 33ms createsubdb tmp97/17949317426677965256/linclust/6279588666755106708/order_redundancy tmp97/17949317426677965256/linclust/6279588666755106708/pref tmp97/17949317426677965256/linclust/6279588666755106708/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 3ms Time for processing: 0h 1m 40s 43ms filterdb tmp97/17949317426677965256/linclust/6279588666755106708/pref_filter1 tmp97/17949317426677965256/linclust/6279588666755106708/pref_filter2 --filter-file tmp97/17949317426677965256/linclust/6279588666755106708/order_redundancy --threads 20 --compressed 0 -v 3

Filtering using file(s) [=================================================================] 100.00% 25.63K 0s 61ms Time for merging to pref_filter2: 0h 0m 0s 138ms Time for processing: 0h 1m 40s 346ms align tmp97/17949317426677965256/linclust/6279588666755106708/input_step_redundancy tmp97/17949317426677965256/linclust/6279588666755106708/input_step_redundancy tmp97/17949317426677965256/linclust/6279588666755106708/pref_filter2 tmp97/17949317426677965256/linclust/6279588666755106708/aln --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.97 --min-aln-len 200 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 10000 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 20 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 25629 type: Nucleotide Target database size: 25629 type: Nucleotide Calculation of alignments [=================================================================] 100.00% 25.63K 1s 731ms Time for merging to aln: 0h 0m 0s 152ms 209782 alignments calculated 32736 sequence pairs passed the thresholds (0.156048 of overall calculated) 1.277303 hits per query sequence Time for processing: 0h 3m 22s 221ms clust tmp97/17949317426677965256/linclust/6279588666755106708/input_step_redundancy tmp97/17949317426677965256/linclust/6279588666755106708/aln tmp97/17949317426677965256/linclust/6279588666755106708/clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 20 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover [=================================================================] 100.00% 25.63K 0s 25ms Sort entries Find missing connections Found 7107 new connections. Reconstruct initial order [=================================================================] 100.00% 25.63K 0s 32ms Add missing connections [=================================================================] 100.00% 25.63K 0s 2ms

Time for read in: 0h 0m 0s 105ms Total time: 0h 0m 0s 115ms

Size of the sequence database: 25629 Size of the alignment database: 25629 Number of clusters: 19825

Writing results 0h 0m 0s 7ms Time for merging to clust: 0h 0m 0s 2ms Time for processing: 0h 0m 0s 177ms mergeclusters shrimpDB tmp97/17949317426677965256/clu_redundancy tmp97/17949317426677965256/linclust/6279588666755106708/pre_clust tmp97/17949317426677965256/linclust/6279588666755106708/clust --threads 20 --compressed 0 -v 3

Clustering step 1 [=================================================================] 100.00% 25.63K 0s 41ms Clustering step 2 [=================================================================] 100.00% 19.83K 0s 74ms Write merged clustering [=================================================================] 100.00% 51.99K 3m 20s 159ms Time for merging to clu_redundancy: 0h 0m 0s 136ms Time for processing: 0h 3m 20s 346ms createsubdb tmp97/17949317426677965256/clu_redundancy shrimpDB tmp97/17949317426677965256/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 3ms Time for processing: 0h 1m 40s 35ms extractframes tmp97/17949317426677965256/input_step_redundancy tmp97/17949317426677965256/query_seqs --forward-frames 1 --reverse-frames 1 --create-lookup 0 --threads 20 --compressed 0 -v 3

[=================================================================] 100.00% 19.83K 0s 62ms Time for merging to query_seqs_h: 0h 0m 0s 261ms Time for merging to query_seqs: 0h 1m 40s 122ms Time for processing: 0h 8m 20s 689ms prefilter tmp97/17949317426677965256/query_seqs tmp97/17949317426677965256/input_step_redundancy tmp97/17949317426677965256/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 7.5 -k 15 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 10000 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 0 --exact-kmer-matching 1 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 60 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 20 --compressed 0 -v 3

Query database size: 39650 type: Nucleotide Estimated memory consumption: 8G Target database size: 19825 type: Nucleotide Index table k-mer threshold: 0 at k-mer size 15 Index table: counting k-mers [=================================================================] 100.00% 19.83K 0s 193ms Index table: Masked residues: 3096 Index table: fill [=================================================================] 100.00% 19.83K 0s 133ms Index statistics Entries: 5861616 DB size: 8225 MB Avg k-mer size: 0.005459 Top 10 k-mers GTACGTGAATTGAAT 10331 AAACTGGGAGAAGAT 9920 AAGGGGGGGCCGGTT 9235 CGAACGTGGGAACGG 8944 GGGGAAAGGCTGGGG 7283 TCGATTACGGTAACG 6945 GTGCGCAGCGTATCG 6389 CCCGGCTCACGAATG 5538 ACTGCGTAAGGGTGG 5044 GACCGAGGGCACGGG 4773 Time for index table init: 0h 0m 8s 613ms Process prefiltering step 1 of 1

k-mer similarity threshold: 0 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 39650 Target db start 1 to 19825 [=================================================================] 100.00% 39.65K 3s 800ms

0.928116 k-mers per position 129909 DB matches per sequence 0 overflows 85 sequences passed prefiltering per query sequence 1 median result list length 19825 sequences with 0 size result lists Time for merging to pref: 0h 0m 0s 137ms Time for processing: 0h 3m 32s 791ms rescorediagonal tmp97/17949317426677965256/query_seqs tmp97/17949317426677965256/input_step_redundancy tmp97/17949317426677965256/pref tmp97/17949317426677965256/aln_ungapped --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 2 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.97 --min-aln-len 200 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 20 --compressed 0 -v 3

[=================================================================] 100.00% 39.65K 0s 376ms Time for merging to aln_ungapped: 0h 0m 0s 129ms Time for processing: 0h 6m 40s 663ms subtractdbs tmp97/17949317426677965256/pref tmp97/17949317426677965256/aln_ungapped tmp97/17949317426677965256/pref_subtract --threads 20 --compressed 0 -v 3

subtractdbs tmp97/17949317426677965256/pref tmp97/17949317426677965256/aln_ungapped tmp97/17949317426677965256/pref_subtract --threads 20 --compressed 0 -v 3

Remove tmp97/17949317426677965256/aln_ungapped ids from tmp97/17949317426677965256/pref [=================================================================] 100.00% 39.65K 0s 144ms Time for merging to pref_subtract: 0h 0m 0s 168ms Time for processing: 0h 0m 0s 503ms align tmp97/17949317426677965256/query_seqs tmp97/17949317426677965256/input_step_redundancy tmp97/17949317426677965256/pref_subtract tmp97/17949317426677965256/aln_gapped --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.97 --min-aln-len 200 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 10000 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 20 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 39650 type: Nucleotide Target database size: 19825 type: Nucleotide Calculation of alignments [=================================================================] 100.00% 39.65K 11s 920ms Time for merging to aln_gapped: 0h 1m 40s 131ms 2105218 alignments calculated 1620 sequence pairs passed the thresholds (0.000770 of overall calculated) 0.040858 hits per query sequence Time for processing: 0h 5m 12s 297ms concatdbs tmp97/17949317426677965256/aln_ungapped tmp97/17949317426677965256/aln_gapped tmp97/17949317426677965256/aln --preserve-keys --take-larger-entry --threads 20 --compressed 0 -v 3

[=================================================================] 100.00% 39.65K 0s 67ms [=================================================================] 100.00% 39.65K 0s 98ms Time for merging to aln: 0h 0m 0s 389ms Time for processing: 0h 0m 0s 669ms offsetalignment tmp97/17949317426677965256/input_step_redundancy tmp97/17949317426677965256/query_seqs tmp97/17949317426677965256/input_step_redundancy tmp97/17949317426677965256/input_step_redundancy tmp97/17949317426677965256/aln tmp97/17949317426677965256/aln_off --chain-alignments 0 --merge-query 1 --search-type 3 --threads 20 --compressed 0 --db-load-mode 0 -v 3

Computing ORF lookup Computing contig offsets Computing contig lookup Time for contig lookup: 0h 0m 0s 3ms Writing results to: tmp97/17949317426677965256/aln_off [=================================================================] 100.00% 51.99K 0s 153ms

Time for merging to aln_off: 0h 0m 0s 136ms Time for processing: 0h 3m 20s 468ms clust tmp97/17949317426677965256/input_step_redundancy tmp97/17949317426677965256/aln_off tmp97/17949317426677965256/clu --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 20 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover [=================================================================] 100.00% 19.83K 0s 53ms Sort entries Find missing connections Found 49253 new connections. Reconstruct initial order [=================================================================] 100.00% 19.83K 0s 60ms Add missing connections [=================================================================] 100.00% 19.83K 0s 115ms

Time for read in: 0h 0m 0s 296ms there must be an error: 50016 deleted from 30868 that now is empty, but not assigned to a cluster there must be an error: 33062 deleted from 15885 that now is empty, but not assigned to a cluster there must be an error: 40430 deleted from 27586 that now is empty, but not assigned to a cluster there must be an error: 13350 deleted from 28482 that now is empty, but not assigned to a cluster there must be an error: 11573 deleted from 27334 that now is empty, but not assigned to a cluster Total time: 0h 0m 0s 320ms

Size of the sequence database: 19825 Size of the alignment database: 19825 Number of clusters: 12502

Writing results 0h 0m 0s 6ms Time for merging to clu: 0h 0m 0s 3ms Time for processing: 0h 0m 0s 356ms mergeclusters shrimpDB shrimp.clu97 tmp97/17949317426677965256/clu_redundancy tmp97/17949317426677965256/clu --threads 20 --compressed 0 -v 3

Clustering step 1 [=================================================================] 100.00% 19.83K 0s 39ms Clustering step 2 [=================================================================] 100.00% 12.50K 0s 67ms Write merged clustering [=================================================================] 100.00% 51.99K 0s 142ms Time for merging to shrimp.clu97: 0h 0m 0s 119ms Time for processing: 0h 0m 0s 338ms

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): bb0a1b3569b9fe115f3bf63e5ba1da234748de23
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): static build with AVX2
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: cmake version 2.8.12.2
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): 250GB RAM, 20 CPUs
Operating system and version: Centos7

Johnsonzcode commented 10 months ago

Same error. Hope it will be fixed.

valentynbez commented 7 months ago

Same error on clustering protein sequences with easy-cluster, command I used:

mmseqs easy-cluster raw/proteins.faa.gz processed/proteins.id50.c90 tmp \
    -c 0.9 --min-seq-id 0.5 --threads 16 --cluster-reassign

timyerg commented 7 months ago

Hello! Same error with nucleotide sequences.

cschu commented 7 months ago

Hi, I got the same issue with nucleotide sequences (version 15.6f452).

cluster --threads 8 --split-memory-limit 128G --min-seq-id 0.95 -c 0.90 --cov-mode 0

d-jch commented 3 months ago

Hi, I got the same issue with nucleotide sequences (version c498f51053e2f550a4ab4bee534b0ef80033a2b3 and 15.6f452). mmseqs clust tmp/13448582598387550165/clu_tmp/9227260758217224448/input_step_redundancy tmp/13448582598387550165/clu_tmp/9227260758217224448/aln_off tmp/13448582598387550165/clu_tmp/9227260758217224448/clu --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 20 --compressed 0 -v 3 --cluster-weight-threshold 0.9

soedinglab / MMseqs2