Open altaetran opened 4 years ago
Yes, 2^32-1
is the current maximum of MMseqs2. We should check this in createdb
.
Linclust seemed to fail still with 3.6B sequences. Is the max 2^31-1?
Could you post the log and the machine specs? It should work up to UINT_MAX without issue, however RAM overhead just for holding the entries also increases linearly and might become an issue at this point.
There should be enough RAM, since the usage never tops 40% or so. It is a 2TB memory machine with 160 cores. Usually I get something like this, which happens after the cluster calculation step.
@altaetran this is indeed not right. Could you please provide the whole log and command call?
the whole log overwhelms my system, but I captured most of the information that occurs before this bug:
clusterer:/mnt/cluster/filt_80_2020-04-19/combined12$ time /custom_install/installations/mmseqs-nonmpi/MMseqs2/build/bin/mmseqs linclust inDB linClu90DB tmp --min-seq-id 0.90 --kme
r-per-seq 40 -c 0.9
Tmp tmp folder does not exist or is not a directory.
Create dir tmp
linclust inDB linClu90DB tmp --min-seq-id 0.90 --kmer-per-seq 40 -c 0.9
MMseqs Version: 290668474611312a94a868bf041b38c8618f5ef6
Cluster mode 0
Max connected component depth 1000
Similarity type 2
Threads 160
Compressed 0
Verbosity 3
Substitution matrix nucl:nucleotide.out,aa:blosum62.out
Add backtrace false
Alignment mode 2
Allow wrapped scoring false
E-value threshold 0.001
Seq. id. threshold 0.9
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0.9
Coverage mode 0
Max sequence length 65535
Compositional bias 1
Realign hits false
Max reject 2147483647
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a 1
Pseudo count b 1.5
Score bias 0
Gap open cost 11
Gap extension cost 1
Zdrop 40
Alphabet size nucl:5,aa:21
k-mers per sequence 40
Spaced k-mers 0
Spaced k-mer pattern
Scale k-mers per sequence nucl:0.200,aa:0.000
Adjust k-mer length false
Mask residues 0
Mask lower case residues 0
k-mer length 0
Shift hash 67
Split memory limit 0
Include only extendable false
Skip repeating k-mers false
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Remove temporary files false
Force restart with latest tmp false
MPI runner
Set cluster mode SET COVER.
kmermatcher inDB tmp/18375844090983922724/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.9 --kmer-per-seq 40 --spaced-kmer-mode 0 --kmer-per-seq-scale
nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-mul
ti-kmer 0 --threads 160 --compressed 0 -v 3
kmermatcher inDB tmp/18375844090983922724/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.9 --kmer-per-seq 40 --spaced-kmer-mode 0 --kmer-per-seq-scale
nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-mul
ti-kmer 0 --threads 160 --compressed 0 -v 3
Database size: 2830651961 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)
Generate k-mers list for 1 split
[=================================================================] 100.00% 2.83B 2h 28m 50s 478ms
Sort kmer 1h 11m 10s 680ms
Time for fill: 0h 14m 35s 645ms
Time for merging to pref: 0h 31m 17s 620ms
Time for processing: 5h 43m 57s 88ms
rescorediagonal inDB inDB tmp/18375844090983922724/pref tmp/18375844090983922724/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0
-e 0.001 -c 0.9 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 160 --compressed 0 -v 3
[=================================================================] 100.00% 2.83B 1h 11m 50s 841ms
[=================================================================]
.
.
.
100.00% 1.00M 0s 440ms
[=================================================================] 100.00% 1.00M 0s 541ms
[=================================================================] 100.00% 1.00M 0s 501ms
[=================================================================] 100.00% 1.00M 0s 455ms
[=================================================================] 100.00% 1.00M 0s 609ms
[=================================================================] 100.00% 1.00M 0s 536ms
[=================================================================] 100.00% 1.00M 0s 626ms
[=================================================================] 100.00% 1.00M 0s 591ms
[=================================================================] 100.00% 1.00M 0s 572ms
[=================================================================] 100.00% 651.96K 0s 344ms
Sort entries
Find missing connections
I saw there was a potential fix regarding the database size limit on the github. Was anyone able to take a look at this issue? I'm excited to try MMseqs2 out on very large databases! Thanks!
Sorry, your issue fell through the cracks. Did you try again? We clustered things that were just under the INT_MAX limit before without issues. If you can give us any more details to investigate what might have gone wrong we can look into it, right now I have no idea where to start.
Following up on this, running into the same issue. Any plans to increase the limit?
Expected Behavior
createdb creates a database with no error
Current Behavior
createdb stalls and stops reporting once 10962815327 sequences have been added and fails to make progress. Is there a limit to the number of sequences that can be in a database?
Steps to Reproduce (for bugs)
/custom_install/installations/mmseqs-nonmpi/MMseqs2/build/bin/mmseqs createdb ../2020-04-19-in.fa/part-* inDB --createdb-mode 1
MMseqs Output (for bugs)