easy-linclust got stuck when clustering SRC

arglog commented 4 years ago

Summary: Running easy-linclust on SRC got stuck after the first call of rescorediagonal. No progress and no printed information for ~12h. Not sure if it's related to #323 but since it's a different behavior I just open a new issue.

Expected Behavior

Normally exit

Current Behavior

Got stuck after the first call of rescorediagonal. No progress and no printed information for ~12h.

Steps to Reproduce (for bugs)

> wget http://gwdu111.gwdg.de/~compbiol/plass/2018_08/SRC.fasta.gz
> gunzip -k SRC.fasta.gz
> mmseqs easy-linclust SRC.fasta test/SRC-50 /export/scratch/SRC-50 -c 0.9 --cov-mode 1 --cluster-mode 2 --min-seq-id 0.5 --split-memory-limit 500G

MMseqs Output (for bugs)

easy-linclust SRC.fasta test/SRC-50 /export/scratch/SRC-50 -c 0.9 --cov-mode 1 --cluster-mode 2 --min-seq-id 0.5 --split-memory-limit 500G

MMseqs Version:                         cab0e83840f5afa0632aada56e6bacaf46211c33
Cluster mode                            2
Max connected component depth           1000
Similarity type                         2
Threads                                 96
Compressed                              0
Verbosity                               3
Substitution matrix                     nucl:nucleotide.out,aa:blosum62.out
Add backtrace                           false
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0.5
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0.9
Coverage mode                           1
Max sequence length                     65535
Compositional bias                      1
Realign hits                            false
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          1
Pseudo count b                          1.5
Score bias                              0
Gap open cost                           nucl:5,aa:11
Gap extension cost                      nucl:2,aa:1
Zdrop                                   40
Alphabet size                           nucl:5,aa:21
k-mers per sequence                     21
Spaced k-mers                           0
Spaced k-mer pattern
Scale k-mers per sequence               nucl:0.200,aa:0.000
Adjust k-mer length                     false
Mask residues                           1
Mask lower case residues                0
k-mer length                            0
Shift hash                              67
Split memory limit                      500G
Include only extendable                 false
Skip repeating k-mers                   false
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Remove temporary files                  true
Force restart with latest tmp           false
MPI runner
Database type                           0
Shuffle input database                  true
Createdb mode                           1
Write lookup file                       0
Offset of numeric ids                   0

createdb SRC.fasta /export/scratch/SRC-50/8871099322051866948/input --dbtype 0 --shuffle 1 --createdb-mode 1 --write-lookup 0 --id-offset 0 --compressed 0 -v 3

Shuffle database cannot be combined with --createdb-mode 0
We recompute with --shuffle 0
Converting sequences
[2022891328] 19m 44s 787ms
Time for merging to input_h: 0h 15m 26s 958ms
Time for merging to input: 0h 15m 5s 407ms
Database type: Aminoacid
Time for processing: 0h 51m 25s 878ms
Tmp /export/scratch/SRC-50/8871099322051866948/clu_tmp folder does not exist or is not a directory.
Create dir /export/scratch/SRC-50/8871099322051866948/clu_tmp
linclust /export/scratch/SRC-50/8871099322051866948/input /export/scratch/SRC-50/8871099322051866948/clu /export/scratch/SRC-50/8871099322051866948/clu_tmp --cluster-mode 2 -e 0.001 --min-seq-id 0.5 -c 0.9 --cov-mode 1 --spaced-kmer-mode 0 --split-memory-limit 500G --remove-tmp-files 1

kmermatcher /export/scratch/SRC-50/8871099322051866948/input /export/scratch/SRC-50/8871099322051866948/clu_tmp/15588367470074044035/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.5 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 500G --include-only-extendable 0 --ignore-multi-kmer 0 --threads 96 --compressed 0 -v 3

kmermatcher /export/scratch/SRC-50/8871099322051866948/input /export/scratch/SRC-50/8871099322051866948/clu_tmp/15588367470074044035/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.5 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 500G --include-only-extendable 0 --ignore-multi-kmer 0 --threads 96 --compressed 0 -v 3

Database size: 2022891389 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Not enough memory to process at once need to split
[=================================================================] 100.00% 2.02B 18m 29s 316ms
Process file into 2 parts
Generate k-mers list for 1 split
[=================================================================] 100.00% 2.02B 11m 22s 53ms
Sort kmer 0h 17m 18s 696ms
Sort by rep. sequence 0h 8m 48s 22ms
Generate k-mers list for 2 split
[=================================================================] 100.00% 2.02B 14m 32s 166ms
Sort kmer 0h 6m 35s 100ms
Sort by rep. sequence 0h 2m 51s 246ms
Merge splits ... Time for fill: 2h 18m 33s 262ms
Time for merging to pref: 0h 25m 57s 283ms
Time for processing: 4h 41m 10s 259ms
rescorediagonal /export/scratch/SRC-50/8871099322051866948/input /export/scratch/SRC-50/8871099322051866948/input /export/scratch/SRC-50/8871099322051866948/clu_tmp/15588367470074044035/pref /export/scratch/SRC-50/8871099322051866948/clu_tmp/15588367470074044035/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.9 -a 0 --cov-mode 1 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3

^^^^^^ There is no more printed info after the last line in the above output, and it got stuck for more than 12h.

Context

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): cab0e83840f5afa0632aada56e6bacaf46211c33
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): self-compiled
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version:

milot-mirdita commented 4 years ago

That's most likely a different issue. We have a problem if we estimate RAM usage wrong. When that happens performance usually tanks pretty hard. Martin did some improvements recently to lessen this problem but apparently its still a problem.

What are the system specs where you are running this clustering on?

arglog commented 4 years ago

96 CPU cores (Intel Xeon)
550G RAM
5T HDD drive

milot-mirdita commented 4 years ago

I think the sequence database is just a bit too large to fit into RAM. You could try to use the --compressed 1 parameter to compress each sequence (and all intermediate databases). You will pay a slight cost in runtime for the constant decompression, but that will be more than offset since the sequences will not be constantly evicted from the OS file cache.

Dealing with billions of sequences is still kind of awkward and difficult. We have to improve memory management for these cases.

soedinglab / MMseqs2