cluster stuck at prefilter stage for multiple days

bresyd commented 4 years ago

Expected Behavior

not sure

Current Behavior

cluster is stuck at prefilter stage for > 3 days

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

MMseqs Output (for bugs)

Here is the log printed on the screen:

/home/btschits/apps/MMseqs2/build/bin/mmseqs cluster ../so245_combined_assembly_mmseqDB so245_combined_assembly_mmseq_clustered cluster_tmp --cluster-mode 2 --alignment-mode 3 --cov-mode 1 -c 0.99 --min-seq-id 0.99 --max-seq-len 10000000 --cluster-reassign 1 --threads 40 Tmp cluster_tmp folder does not exist or is not a directory. Create dir cluster_tmp cluster ../so245_combined_assembly_mmseqDB so245_combined_assembly_mmseq_clustered cluster_tmp --cluster-mode 2 --alignment-mode 3 --cov-mode 1 -c 0.99 --min-seq-id 0.99 --max-seq-len 10000000 --cluster-reassign 1 --threads 40

MMseqs Version: 61ca48883b50714be51fc35fc9b77325ffde53fb Substitution matrix nucl:nucleotide.out,aa:blosum62.out Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 4 k-mer length 0 k-score 2147483647 Alphabet size nucl:5,aa:21 Max sequence length 10000000 Max results per query 20 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.99 Coverage mode 1 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask lower case residues 0 Minimum diagonal score 15 Include identical seq. id. false Spaced k-mers 1 Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Spaced k-mer pattern
Local temporary path
Threads 40 Compressed 0 Verbosity 3 Add backtrace false Alignment mode 3 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0.99 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Realign hits false Max reject 2147483647 Max accept 2147483647 Score bias 0 Gap open cost 11 Gap extension cost 1 Zdrop 40 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Cluster mode 2 Max connected component depth 1000 Similarity type 2 Single step clustering false Cascaded clustering steps 3 Cluster reassign 1 Remove temporary files false Force restart with latest tmp false MPI runner
k-mers per sequence 21 Scale k-mers per sequence 0 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false

Set cluster sensitivity to -s 1.000000 Set cluster iterations to 1 linclust ../so245_combined_assembly_mmseqDB cluster_tmp/1127447206531551203/clu_redundancy cluster_tmp/1127447206531551203/linclust --cluster-mode 2 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 0 -v 3 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.99 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.99 --cov-mode 1 --max-seq-len 10000000 --comp-bias-corr 0 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --zdrop 40 --alph-size nucl:5,aa:13 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-reuse 0

kmermatcher ../so245_combined_assembly_mmseqDB cluster_tmp/1127447206531551203/linclust/8761493678692146066/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.99 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.99 --max-seq-len 10000000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 40 --compressed 0 -v 3

Database size: 121964581 type: Nucleotide

Generate k-mers list for 1 split [=================================================================] 100.00% 121.96M 2m 14s 699ms

Adjusted k-mer length 17 Sort kmer 0h 2m 30s 173ms Sort by rep. sequence 0h 1m 48s 715ms Time for fill: 0h 2m 33s 635ms Time for merging to pref: 0h 1m 8s 632ms Time for processing: 0h 12m 5s 26ms rescorediagonal ../so245_combined_assembly_mmseqDB ../so245_combined_assembly_mmseqDB cluster_tmp/1127447206531551203/linclust/8761493678692146066/pref cluster_tmp/1127447206531551203/linclust/8761493678692146066/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.99 -a 0 --cov-mode 1 --min-seq-id 0.99 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 40 --compressed 0 -v 3

[=================================================================] 100.00% 121.96M 5m 3s 965ms
Time for merging to pref_rescore1: 0h 1m 26s 443ms Time for processing: 0h 7m 19s 298ms clust ../so245_combined_assembly_mmseqDB cluster_tmp/1127447206531551203/linclust/8761493678692146066/pref_rescore1 cluster_tmp/1127447206531551203/linclust/8761493678692146066/pre_clust --cluster-mode 2 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 0 -v 3

Clustering mode: Greedy Total time: 0h 0m 53s 944ms

Size of the sequence database: 121964581 Size of the alignment database: 121964581 Number of clusters: 119253279

Writing results 0h 1m 14s 106ms Time for merging to pre_clust: 0h 0m 55s 443ms Time for processing: 0h 4m 11s 817ms createsubdb cluster_tmp/1127447206531551203/linclust/8761493678692146066/order_redundancy ../so245_combined_assembly_mmseqDB cluster_tmp/1127447206531551203/linclust/8761493678692146066/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 28s 447ms Time for processing: 0h 1m 21s 26ms createsubdb cluster_tmp/1127447206531551203/linclust/8761493678692146066/order_redundancy cluster_tmp/1127447206531551203/linclust/8761493678692146066/pref cluster_tmp/1127447206531551203/linclust/8761493678692146066/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 30s 812ms Time for processing: 0h 1m 20s 243ms filterdb cluster_tmp/1127447206531551203/linclust/8761493678692146066/pref_filter1 cluster_tmp/1127447206531551203/linclust/8761493678692146066/pref_filter2 --filter-file cluster_tmp/1127447206531551203/linclust/8761493678692146066/order_redundancy --threads 40 --compressed 0 -v 3

Filtering using file(s) [=================================================================] 100.00% 119.25M 2m 56s 532ms
Time for merging to pref_filter2: 0h 1m 34s 791ms Time for processing: 0h 5m 27s 349ms align cluster_tmp/1127447206531551203/linclust/8761493678692146066/input_step_redundancy cluster_tmp/1127447206531551203/linclust/8761493678692146066/input_step_redundancy cluster_tmp/1127447206531551203/linclust/8761493678692146066/pref_filter2 cluster_tmp/1127447206531551203/linclust/8761493678692146066/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 3 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.99 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.99 --cov-mode 1 --max-seq-len 10000000 --comp-bias-corr 0 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --zdrop 40 --threads 40 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 119253279 type: Nucleotide Target database size: 119253279 type: Nucleotide Calculation of alignments [=================================================================] 100.00% 119.25M 1h 5m 19s 819ms
Time for merging to aln: 0h 1m 24s 501ms

1903287037 alignments calculated. 119437486 sequence pairs passed the thresholds (0.062753 of overall calculated). 1.001545 hits per query sequence. Time for processing: 1h 7m 38s 628ms clust cluster_tmp/1127447206531551203/linclust/8761493678692146066/input_step_redundancy cluster_tmp/1127447206531551203/linclust/8761493678692146066/aln cluster_tmp/1127447206531551203/linclust/8761493678692146066/clust --cluster-mode 2 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 0 -v 3

Clustering mode: Greedy Total time: 0h 0m 49s 674ms

Size of the sequence database: 119253279 Size of the alignment database: 119253279 Number of clusters: 119070883

Writing results 0h 1m 6s 430ms Time for merging to clust: 0h 1m 2s 216ms Time for processing: 0h 3m 54s 259ms mergeclusters ../so245_combined_assembly_mmseqDB cluster_tmp/1127447206531551203/clu_redundancy cluster_tmp/1127447206531551203/linclust/8761493678692146066/pre_clust cluster_tmp/1127447206531551203/linclust/8761493678692146066/clust --threads 40 --compressed 0 -v 3

Clustering step 1 [=================================================================] 100.00% 119.25M 29s 301ms
Clustering step 2 [=================================================================] 100.00% 119.07M 1m 36s 476ms
Write merged clustering [=================================================================] 100.00% 121.96M 2m 4s 130ms
Time for merging to clu_redundancy: 0h 1m 28s 54ms Time for processing: 0h 4m 53s 669ms createsubdb cluster_tmp/1127447206531551203/clu_redundancy ../so245_combined_assembly_mmseqDB cluster_tmp/1127447206531551203/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 26s 718ms Time for processing: 0h 1m 14s 12ms prefilter cluster_tmp/1127447206531551203/input_step_redundancy cluster_tmp/1127447206531551203/input_step_redundancy cluster_tmp/1127447206531551203/pref_step0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 1 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 10000000 --max-seqs 20 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.99 --cov-mode 1 --comp-bias-corr 0 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 0 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 40 --compressed 0 -v 3

Query database size: 119070883 type: Nucleotide Estimated memory consumption: 544G Target database size: 119070883 type: Nucleotide Index table k-mer threshold: 0 at k-mer size 7 Index table: counting k-mers [=================================================================] 100.00% 119.07M 11m 59s 618ms
Index table: Masked residues: 1954198777 Index table: fill [=================================================================] 100.00% 119.07M 8m 53s 150ms
Index statistics Entries: 43365843082 DB size: 248141 MB Avg k-mer size: 2646841.008423 Top 10 k-mers AAAAAAA 23063795 AAATTAA 21644433 AATTTAA 21117606 AAAAATT 20537065 AAATTTT 19681970 TTTTTTT 19035614 AATTTTT 18870179 ATTTTTT 17045685 TTTATTT 16633440 ATAAATT 16256141 Time for index table init: 0h 23m 3s 602ms Process prefiltering step 1 of 1

k-mer similarity threshold: 0 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 119070883 Target db start 1 to 119070883 [> ] 0.00% 1 eta -

Here are the files that were so far created in the tmp directory: -rwx------ 1 btschits biogeo 11721 Apr 16 18:49 cascaded_clustering.sh -rw-r--r-- 1 btschits biogeo 4 Apr 16 20:38 clu_redundancy.dbtype -rw-r--r-- 1 btschits biogeo 2631811454 Apr 16 20:38 clu_redundancy.index -rw-r--r-- 1 btschits biogeo 28134491 Apr 16 20:37 clu_redundancy.0 -rw-r--r-- 1 btschits biogeo 28195022 Apr 16 20:37 clu_redundancy.1 -rw-r--r-- 1 btschits biogeo 34003434 Apr 16 20:37 clu_redundancy.2 -rw-r--r-- 1 btschits biogeo 27288866 Apr 16 20:37 clu_redundancy.3 -rw-r--r-- 1 btschits biogeo 30156064 Apr 16 20:37 clu_redundancy.4 -rw-r--r-- 1 btschits biogeo 33581078 Apr 16 20:37 clu_redundancy.5 -rw-r--r-- 1 btschits biogeo 31402798 Apr 16 20:37 clu_redundancy.6 -rw-r--r-- 1 btschits biogeo 34117685 Apr 16 20:37 clu_redundancy.7 -rw-r--r-- 1 btschits biogeo 29287509 Apr 16 20:37 clu_redundancy.8 -rw-r--r-- 1 btschits biogeo 31087518 Apr 16 20:37 clu_redundancy.9 -rw-r--r-- 1 btschits biogeo 41727025 Apr 16 20:37 clu_redundancy.10 -rw-r--r-- 1 btschits biogeo 27407122 Apr 16 20:37 clu_redundancy.11 -rw-r--r-- 1 btschits biogeo 40334061 Apr 16 20:37 clu_redundancy.12 -rw-r--r-- 1 btschits biogeo 36353076 Apr 16 20:37 clu_redundancy.13 -rw-r--r-- 1 btschits biogeo 28879522 Apr 16 20:37 clu_redundancy.14 -rw-r--r-- 1 btschits biogeo 30977678 Apr 16 20:37 clu_redundancy.15 -rw-r--r-- 1 btschits biogeo 35186268 Apr 16 20:37 clu_redundancy.16 -rw-r--r-- 1 btschits biogeo 25359660 Apr 16 20:37 clu_redundancy.17 -rw-r--r-- 1 btschits biogeo 27871193 Apr 16 20:37 clu_redundancy.18 -rw-r--r-- 1 btschits biogeo 31416976 Apr 16 20:37 clu_redundancy.19 -rw-r--r-- 1 btschits biogeo 31695668 Apr 16 20:37 clu_redundancy.20 -rw-r--r-- 1 btschits biogeo 33818265 Apr 16 20:37 clu_redundancy.21 -rw-r--r-- 1 btschits biogeo 29237438 Apr 16 20:37 clu_redundancy.22 -rw-r--r-- 1 btschits biogeo 27935302 Apr 16 20:37 clu_redundancy.23 -rw-r--r-- 1 btschits biogeo 28823424 Apr 16 20:37 clu_redundancy.24 -rw-r--r-- 1 btschits biogeo 27534453 Apr 16 20:37 clu_redundancy.25 -rw-r--r-- 1 btschits biogeo 30455301 Apr 16 20:37 clu_redundancy.26 -rw-r--r-- 1 btschits biogeo 28701420 Apr 16 20:37 clu_redundancy.27 -rw-r--r-- 1 btschits biogeo 32188894 Apr 16 20:37 clu_redundancy.28 -rw-r--r-- 1 btschits biogeo 30423235 Apr 16 20:37 clu_redundancy.29 -rw-r--r-- 1 btschits biogeo 28317863 Apr 16 20:37 clu_redundancy.30 -rw-r--r-- 1 btschits biogeo 29122861 Apr 16 20:37 clu_redundancy.31 -rw-r--r-- 1 btschits biogeo 30789375 Apr 16 20:37 clu_redundancy.32 -rw-r--r-- 1 btschits biogeo 29195208 Apr 16 20:37 clu_redundancy.33 -rw-r--r-- 1 btschits biogeo 26808504 Apr 16 20:37 clu_redundancy.34 -rw-r--r-- 1 btschits biogeo 31027739 Apr 16 20:37 clu_redundancy.35 -rw-r--r-- 1 btschits biogeo 28576122 Apr 16 20:37 clu_redundancy.36 -rw-r--r-- 1 btschits biogeo 28892767 Apr 16 20:37 clu_redundancy.37 -rw-r--r-- 1 btschits biogeo 27212800 Apr 16 20:37 clu_redundancy.38 -rw-r--r-- 1 btschits biogeo 34081898 Apr 16 20:37 clu_redundancy.39 lrwxrwxrwx 1 btschits biogeo 61 Apr 16 20:40 input_step_redundancy -> /scratch/btschits/so245_mmseq/so245_combined_assembly_mmseqDB -rw-r--r-- 1 btschits biogeo 4 Apr 16 20:40 input_step_redundancy.dbtype -rw-r--r-- 1 btschits biogeo 2965266723 Apr 16 20:40 input_step_redundancy.index lrwxrwxrwx 1 btschits biogeo 68 Apr 16 20:40 input_step_redundancy.lookup -> /scratch/btschits/so245_mmseq/so245_combined_assembly_mmseqDB.lookup lrwxrwxrwx 1 btschits biogeo 68 Apr 16 20:40 input_step_redundancy.source -> /scratch/btschits/so245_mmseq/so245_combined_assembly_mmseqDB.source lrwxrwxrwx 1 btschits biogeo 63 Apr 16 20:40 input_step_redundancy_h -> /scratch/btschits/so245_mmseq/so245_combined_assembly_mmseqDB_h lrwxrwxrwx 1 btschits biogeo 70 Apr 16 20:40 input_step_redundancy_h.dbtype -> /scratch/btschits/so245_mmseq/so245_combined_assembly_mmseqDB_h.dbtype lrwxrwxrwx 1 btschits biogeo 69 Apr 16 20:40 input_step_redundancy_h.index -> /scratch/btschits/so245_mmseq/so245_combined_assembly_mmseqDB_h.index drwxr-xr-x 3 btschits biogeo 4 Apr 16 18:49 linclust -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.0 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.1 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.2 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.3 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.4 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.5 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.6 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.7 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.8 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.9 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.10 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.11 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.12 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.13 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.14 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.15 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.16 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.17 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.18 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.19 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.20 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.21 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.22 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.23 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.24 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.25 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.26 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.27 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.28 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.29 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.30 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.31 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.32 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.33 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.34 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.35 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.36 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.37 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.38 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.39 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.0 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.1 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.2 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.3 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.4 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.5 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.6 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.7 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.8 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.9 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.10 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.11 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.12 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.13 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.14 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.15 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.16 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.17 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.18 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.19 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.20 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.21 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.22 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.23 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.24 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.25 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.26 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.27 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.28 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.29 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.30 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.31 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.32 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.33 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.34 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.35 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.36 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.37 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.38 -rw-r--r-- 1 btschits biogeo 0 Apr 16 21:03 pref_step0.index.39

Context

I am trying to cluster a large metagenome assembly by similarity (contig nucleotide sequences). In the past I have used cd-hit but for this large dataset I think mmseq is a more appropriate tool, in particular considering the runtime. I have already used the linclust workflow with the same assembly and it worked well, finished within a few hours. I thought I also try the cluster workflow since you say that it is more senstive. The initial steps of the workflow appeared to run well and within an expected time. However, then it got to the prefilter stage and this is where the command prompt has been stuck for over 3 days now. There has also not been any change to the output files during that time. My question is: is this 'normal' behaviour of the cluster workflow, or did I run into an issue and is my job stuck? If it is the latter, would you have any recommendations on how to solve this issue?

Thanks a lot for your help

milot-mirdita commented 4 years ago

That's pretty weird. How much RAM does the machine where MMseqs2 is running?

bresyd commented 4 years ago

Hi,

thanks for your quick reply.

I am using a cluster with a total memory of 2 TB ram. Looks like MMseq is currently using 20% of the total memory. Here is a screenshot of 'top'

milot-mirdita commented 4 years ago

Could you attach with a debugger and see what it is currently doing:

gdb -p 36724

Once you are at a prompt run bt (for backtrace) and then press enter. Copy the output and paste it here. Then run quit and enter to get out of the debugger again.

bresyd commented 4 years ago

Thanks again. I had some issues getting gdb installed and running.

Is this the output you asked for:

Attaching to process 36724
[New LWP 36727]
[New LWP 36729]
[New LWP 36730]
[New LWP 36781]
[New LWP 36782]
[New LWP 36783]
[New LWP 36784]
[New LWP 36785]
[New LWP 36786]
[New LWP 36787]
[New LWP 36788]
[New LWP 36789]
[New LWP 36790]
[New LWP 36791]
[New LWP 36792]
[New LWP 36793]
[New LWP 36794]
[New LWP 36795]
[New LWP 36796]
[New LWP 36797]
[New LWP 36798]
[New LWP 36799]
[New LWP 36800]
[New LWP 36801]
[New LWP 36802]
[New LWP 36803]
[New LWP 36804]
[New LWP 36805]
[New LWP 36806]
[New LWP 36807]
[New LWP 36808]
[New LWP 36809]
[New LWP 36810]
[New LWP 36811]
[New LWP 36812]
[New LWP 36813]
[New LWP 36814]
[New LWP 36815]
[New LWP 36816]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--RET
0x00005623f9de6504 in CacheFriendlyOperations<512u>::hashIndexEntry(unsigned short, IndexEntryLocal*, unsigned long, CounterResult*) ()
(gdb) bt
#0  0x00005623f9de6504 in CacheFriendlyOperations<512u>::hashIndexEntry(unsigned short, IndexEntryLocal*, unsigned long, CounterResult*) ()
#1  0x00005623f9de68e5 in CacheFriendlyOperations<512u>::findDuplicates(IndexEntryLocal**, CounterResult*, unsigned long, unsigned short, unsigned short, bool) ()
#2  0x00005623f9c42be0 in QueryMatcher::match(Sequence*, float*) ()
#3  0x00005623f9c4393d in QueryMatcher::matchQuery(Sequence*, unsigned int) ()
#4  0x00005623f9c2e113 in Prefiltering::runSplit(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, bool) [clone ._omp_fn.1] ()
#5  0x00007fdb5f003cff in GOMP_parallel () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#6  0x00005623f9c337f4 in Prefiltering::runSplit(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, bool) ()
#7  0x00005623f9c352ae in Prefiltering::runSplits(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long, bool) ()
#8  0x00005623f9c2a80a in prefilter(int, char const**, Command const&) ()
#9  0x00005623f9ba5c06 in runCommand(Command*, int, char const**) ()
#10 0x00005623f9b974a5 in main ()

martin-steinegger commented 4 years ago

@bresyd only linclust and easy-linclust supports nucleotide and should be also sensitive enough to cluster down to 75% sequence identity. I would recommend the following parameter for your use case:

mmseqs liniclust ../so245_combined_assembly_mmseqDB so245_combined_assembly_mmseq_clustered cluster_tmp --kmer-per-seq-scale 0.2 --cluster-mode 2 --alignment-mode 3 --cov-mode 1 -c 0.99 --min-seq-id 0.99 --max-seq-len 10000000 --cluster-reassign 1 --threads 40

I have just added --kmer-per-seq-scale 0.2

bresyd commented 4 years ago

Thanks a lot for this clarification. I will stick with linclust then. I already ran linclust without the --kmer-per-seq-scale 0.2 argument a few days ago, and now repeated it including this argument. The results appear very similar, I get 2 less representative sequences including the --kmer-per-seq-scale argument, from a total of ~118 million representative sequences. Could you maybe briefly explain why you suggested using the --kmer-per-seq-scale argument and what it exactly does differently to not using it?

Thanks again and best wishes

milot-mirdita commented 4 years ago

You set very high thresholds for coverage and sequence identity. Additional sensitivity in the clustering won't be able to combine many sequences due to the thresholds anyway.

Linclust has a problem with long sequences where comparing one long sequence with another (short or long) sequence can return no common 20 (default -m) k-mers just by chance since one sequence is very long and it becomes unlikely to find the same k-mers.

--kmer-per-seq-scale scales the number of -m k-mers up with sequence length to avoid this pitfall. Martin enable the parameter yesterday by default for nucleotide clustering.

bresyd commented 4 years ago

Ok, good to know. Thanks a lot for explaining and also for being so responsive to all my questions, much appreciated.

soedinglab / MMseqs2