soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 195 forks source link

createdb max number of sequences #303

Open altaetran opened 4 years ago

altaetran commented 4 years ago

Expected Behavior

createdb creates a database with no error

Current Behavior

createdb stalls and stops reporting once 10962815327 sequences have been added and fails to make progress. Is there a limit to the number of sequences that can be in a database?

Steps to Reproduce (for bugs)

/custom_install/installations/mmseqs-nonmpi/MMseqs2/build/bin/mmseqs createdb ../2020-04-19-in.fa/part-* inDB --createdb-mode 1

MMseqs Output (for bugs)

MMseqs Version:         3863af9ac6d30f3b17620254f3a4a05b7f6b7010
Database type           0
Shuffle input database  true
Createdb mode           1
Offset of numeric ids   0
Compressed              0
Verbosity               3
Shuffle database can not be combined with --createdb-mode 0.
We recompute with --shuffle 0.
Converting sequences
[10962815327] 2h 14m 39s 271ms
martin-steinegger commented 4 years ago

Yes, 2^32-1 is the current maximum of MMseqs2. We should check this in createdb.

altaetran commented 4 years ago

Linclust seemed to fail still with 3.6B sequences. Is the max 2^31-1?

milot-mirdita commented 4 years ago

Could you post the log and the machine specs? It should work up to UINT_MAX without issue, however RAM overhead just for holding the entries also increases linearly and might become an issue at this point.

altaetran commented 4 years ago

There should be enough RAM, since the usage never tops 40% or so. It is a 2TB memory machine with 160 cores. Usually I get something like this, which happens after the cluster calculation step. 96238089_662789574576457_5579459952579182592_n

martin-steinegger commented 4 years ago

@altaetran this is indeed not right. Could you please provide the whole log and command call?

altaetran commented 4 years ago

the whole log overwhelms my system, but I captured most of the information that occurs before this bug:

clusterer:/mnt/cluster/filt_80_2020-04-19/combined12$ time /custom_install/installations/mmseqs-nonmpi/MMseqs2/build/bin/mmseqs linclust inDB linClu90DB tmp --min-seq-id 0.90 --kme
r-per-seq 40 -c 0.9
Tmp tmp folder does not exist or is not a directory.
Create dir tmp
linclust inDB linClu90DB tmp --min-seq-id 0.90 --kmer-per-seq 40 -c 0.9

MMseqs Version:                         290668474611312a94a868bf041b38c8618f5ef6
Cluster mode                            0
Max connected component depth           1000
Similarity type                         2
Threads                                 160
Compressed                              0
Verbosity                               3
Substitution matrix                     nucl:nucleotide.out,aa:blosum62.out
Add backtrace                           false
Alignment mode                          2
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0.9
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0.9
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Realign hits                            false
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          1
Pseudo count b                          1.5
Score bias                              0
Gap open cost                           11
Gap extension cost                      1
Zdrop                                   40
Alphabet size                           nucl:5,aa:21
k-mers per sequence                     40
Spaced k-mers                           0
Spaced k-mer pattern
Scale k-mers per sequence               nucl:0.200,aa:0.000
Adjust k-mer length                     false
Mask residues                           0
Mask lower case residues                0
k-mer length                            0
Shift hash                              67
Split memory limit                      0
Include only extendable                 false
Skip repeating k-mers                   false
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Remove temporary files                  false
Force restart with latest tmp           false
MPI runner

Set cluster mode SET COVER.
kmermatcher inDB tmp/18375844090983922724/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.9 --kmer-per-seq 40 --spaced-kmer-mode 0 --kmer-per-seq-scale
 nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-mul
ti-kmer 0 --threads 160 --compressed 0 -v 3

kmermatcher inDB tmp/18375844090983922724/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.9 --kmer-per-seq 40 --spaced-kmer-mode 0 --kmer-per-seq-scale
 nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-mul
ti-kmer 0 --threads 160 --compressed 0 -v 3

Database size: 2830651961 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Generate k-mers list for 1 split
[=================================================================] 100.00% 2.83B 2h 28m 50s 478ms

Sort kmer 1h 11m 10s 680ms
Time for fill: 0h 14m 35s 645ms
Time for merging to pref: 0h 31m 17s 620ms
Time for processing: 5h 43m 57s 88ms
rescorediagonal inDB inDB tmp/18375844090983922724/pref tmp/18375844090983922724/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 
-e 0.001 -c 0.9 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 160 --compressed 0 -v 3

[=================================================================] 100.00% 2.83B 1h 11m 50s 841ms

[=================================================================] 
.
.
.
100.00% 1.00M 0s 440ms
[=================================================================] 100.00% 1.00M 0s 541ms
[=================================================================] 100.00% 1.00M 0s 501ms
[=================================================================] 100.00% 1.00M 0s 455ms
[=================================================================] 100.00% 1.00M 0s 609ms
[=================================================================] 100.00% 1.00M 0s 536ms
[=================================================================] 100.00% 1.00M 0s 626ms
[=================================================================] 100.00% 1.00M 0s 591ms
[=================================================================] 100.00% 1.00M 0s 572ms
[=================================================================] 100.00% 651.96K 0s 344ms
Sort entries
Find missing connections
altaetran commented 4 years ago

I saw there was a potential fix regarding the database size limit on the github. Was anyone able to take a look at this issue? I'm excited to try MMseqs2 out on very large databases! Thanks!

milot-mirdita commented 3 years ago

Sorry, your issue fell through the cracks. Did you try again? We clustered things that were just under the INT_MAX limit before without issues. If you can give us any more details to investigate what might have gone wrong we can look into it, right now I have no idea where to start.

durrantmm commented 3 years ago

Following up on this, running into the same issue. Any plans to increase the limit?