soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.39k stars 195 forks source link

Core dumping during "createindex" #3

Closed Zaphod-dev closed 8 years ago

Zaphod-dev commented 8 years ago

Hello, I'm eager to try mmseqs but have been unsuccessful at building the database for UniRef90, as mmseqs seg faults and core dumps during the "createindex" phase. I tried both the precompiled mmseqs and my own compiled version, without --split or --threads, as well as with various combinations of both --split and --threads. My machine has 128 GB RAM (and same size swap space) and 6TB free space on hard drive. The process dies before ever reaching more than 40% of RAM. The output is:

mmseqs createdb  uniref90/uniref90.fasta uniref90.mms

ls -ltr uniref90/
-rw-r--r-- 1 hingamp.p MIO 19965315337 sept. 28 22:51 uniref90.fasta
-rw-r--r-- 1 hingamp.p MIO  1244574423 sept. 29 00:07 uniref90.mms.lookup
-rw-r--r-- 1 hingamp.p MIO  4652608129 sept. 29 00:07 uniref90.mms_h
-rw-r--r-- 1 hingamp.p MIO  1025056829 sept. 29 00:09 uniref90.mms_h.index
-rw-r--r-- 1 hingamp.p MIO 15172645206 sept. 29 00:09 uniref90.mms
-rw-r--r-- 1 hingamp.p MIO  1063262010 sept. 29 00:11 uniref90.mms.index

mmseqs createindex uniref90/uniref90.mms --split 10 --threads 20
Program call:
uniref90/uniref90.mms --split 10 --threads 20 

MMseqs Version:         ab6d7b3105611a0860c801603997f1721785916a
Sub Matrix              blosum62.out
K-mer size              0
Alphabet size           21
max. sequence length    32000
Split DB                10
spaced Kmer             1
Threads                 20
Verbosity               3

Substitution matrices...
Index table: counting k-mers...
.WARNING: Sequence (dbKey=10870) contains only ATGC. It might be a nucleotide sequence.
..................................................................................................  1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
...............................................................................
Index table: Masked residues: 26370434
Index table: fill...
................................................................................................... 1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
...............................................................................
Index table: removing duplicate entries...
Index table init done.

Write 10
Write 20
Write 60
Write 70
Write 80
Write 30
Write 40
Index table: counting k-mers...
................................................................................................... 1 Mio. sequences processed
...........................................................................................WARNING: Sequence (dbKey=5712154) contains only ATGC. It might be a nucleotide sequence.
........    2 Mio. sequences processed
...........................................................WARNING: Sequence (dbKey=6387662) contains only ATGC. It might be a nucleotide sequence.
........................................    3 Mio. sequences processed
...................................................................
Index table: Masked residues: 47802947
Index table: fill...
Erreur de segmentation (core dumped)

Many thanks for any help or advice. I have watched the mmseqs demo on https://www.youtube.com/watch?v=LqiHyCLjPno and am looking forward to enjoying the huge simplification it promises (the last example in the demo is the 2bLCA we applied to metagenomics data and my workflow was much more complex and slow than with mmseqs!)... Best, Pascal

martin-steinegger commented 8 years ago

Thanks for the reporting this Bug Pascal. Is this the current Uniref90 release? Can you tell me what kind of CPU and Linux you are using? You can gather this information calling cat /dev/cpuinfoand uname -a.

Could you try running MMseqs2 by just calling mmseqs search without building the index. In this case MMseqs2 decides automatic how to split the database.

milot-mirdita commented 8 years ago

I tried to build an index too, it looks like we have a bug with the --split parameter. Can you execute it without that parameter? It should work fine and still fit into memory ((7b * 340 * 45000000)+(8*21^7b) ≈ 122gb).

Edit: Nevermind you already tried it without the --split parameter. Martin will investigate the bugged --split soon.

Zaphod-dev commented 8 years ago

Thanks for the fast feedback! Here are the requested info about our architecture. I'll try the two suggestions (no indexing, and indexing without the --split) as soon as our machine is free again (a big cd-hit is swapping like crazy)...

$ uname -a 

Linux bioinfo 4.4.0-38-generic #57-Ubuntu SMP Tue Sep 6 15:42:33 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/cpuinfo

processor   : 23
vendor_id   : GenuineIntel
cpu family  : 6
model       : 62
model name  : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping    : 4
microcode   : 0x416
cpu MHz     : 1228.906
cache size  : 15360 KB
physical id : 1
siblings    : 12
core id     : 5
cpu cores   : 6
apicid      : 43
initial apicid  : 43
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
bugs        :
bogomips    : 5189.25
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
milot-mirdita commented 8 years ago

Short update: Martin should have fixed the issue with the --split parameter. MMseqs should actually chose a sensible value for split automatically (now). So you shouldn't need it at all.

Could you try if this fixed your issue?

Zaphod-dev commented 8 years ago

Hello, Indeed I tried the indexing without the --split parameter and this solved the issue! However I now have encountered a new issue during the searching (I'm using a locally compiled version compiled from source cloned from git this morning) - sorry for the french system messages (there is a floating point exception in tmp/blastp.sh : line 60, leading I guess to files missing tmp/pref_4 and tmp/aln_4):

$mmseqs search tara_test.faa uniref90/uniref90.mms test1_uniref90 tmp --threads 24 -a > 
mmseqs_search_tara_1.out
tmp/blastp.sh : ligne 60 : 31911 Exception en point flottant   (core dumped) $RUNNER $MMSEQS prefilter "$INPUT" "$TARGET_DB_PREF" "$TMP_PATH/pref_$SENS" $PREFILTER_PAR -s $SENS
Could not open data file /home/hingamp.p/tmp/pref_4!
mv: impossible d'évaluer '/home/hingamp.p/tmp/aln_4': Aucun fichier ou dossier de ce type

$more mmseqs_search_tara_1.out
Program call:
tara_test.faa uniref90/uniref90.mms test1_uniref90 tmp --threads 24 -a 

MMseqs Version:                     e3588acbec735d8aa3158f7bdf38870bebc7d6df
Sub Matrix                          blosum62.out
Add backtrace                       true
Alignment mode                      0
E-value threshold                   0.001
Seq. Id Threshold                   0
Coverage threshold                  0
Max. sequence length                32000
Max. results per query              300
Compositional bias                  1
Profile                             false
Realign hit                         false
Max Reject                          2147483647
Detect fragments                    false
Include identical Seq. Id.          false
Threads                             24
Verbosity                           3
Sensitivity                         4
K-mer size                          0
K-score                             2147483647
Alphabet size                       21
Offset result                       0
Split DB                            0
Split mode                          2
Diagonal Scoring                    1
Minimum Diagonal score              30
Spaced Kmer                         1
Profile e-value threshold           0.001
Use global sequence weighting       false
Maximum sequence identity threshold 0.9
Minimum seq. id.                    0
Minimum score per column            -20
Minimum coverage                    0
Select n most diverse seqs          100
Pseudo count a                      1
Pseudo count b                      1.5
Number search iterations            1
Start sensitivity                   4
sensitivity step size               1
Sets the MPI runner                 

/home/hingamp.p
/home/hingamp.p
Program call:
tara_test.faa uniref90/uniref90.mms /home/hingamp.p/tmp/pref_4 --sub-mat blosum62.out -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --max-seqs 30
0 --offset-result 0 --split 0 --split-mode 2 --comp-bias-corr 1 --diag-score 1 --min-ungapped-score 30 --spaced-kmer-mode 1 --threads 24 -v 3 -s 4 

MMseqs Version:             e3588acbec735d8aa3158f7bdf38870bebc7d6df
Sub Matrix                  blosum62.out
Sensitivity                 4
K-mer size                  0
K-score                     2147483647
Alphabet size               21
Max. sequence length        32000
Profile                     false
Max. results per query      300
Offset result               0
Split DB                    0
Split mode                  2
Compositional bias          1
Diagonal Scoring            1
Minimum Diagonal score      30
Include identical Seq. Id.  false
Spaced Kmer                 1
Threads                     24
Verbosity                   3

MPI Init...
Rank: 0 Size: 1
Initialising data structures...
Using 24 threads.

Query database: tara_test.faa(size=0)
Target database: uniref90/uniref90.mms(size=44448995)
Use kmer size 7 and split 4 using split mode 0
Needed memory (55204215885 byte) of total memory (135146213376 byte)
Substitution matrices...
Time for init: 0 h 0 m 28s

Process prefiltering step 0 of 4

Index table: counting k-mers...
.WARNING: Sequence (dbKey=10870) contains only ATGC. It might be a nucleotide sequence.
..................................................................................................  1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
................................................................................................... 4 Mio. sequences processed
................................................................................................... 5 Mio. sequences processed
.......................................................................WARNING: Sequence (dbKey=5712154) contains only ATGC. It might be a nucleotide sequence.
............................    6 Mio. sequences processed
......................................WARNING: Sequence (dbKey=6387662) contains only ATGC. It might be a nucleotide sequence.
.............................................................   7 Mio. sequences processed
................................................................................................... 8 Mio. sequences processed
................................................................................................... 9 Mio. sequences processed
..................................
Index table: Masked residues: 99295754
Index table: fill...
................................................................................................... 1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
................................................................................................... 4 Mio. sequences processed
................................................................................................... 5 Mio. sequences processed
................................................................................................... 6 Mio. sequences processed
................................................................................................... 7 Mio. sequences processed
................................................................................................... 8 Mio. sequences processed
................................................................................................... 9 Mio. sequences processed
..................................
Index table: removing duplicate entries...
Index table init done.

DB statistic
Entries:         3630865490
DB Size:         36193901268 (byte)
Avg Kmer Size:   2.01593
Top 10 Kmers
    XXXXXXL     328142
    XXXXXXP     320052
    PXXXXXX     301922
    LXXXXXX     241818
    RXXXXXX     240178
    SXXXXXX     235598
    TXXXXXX     219310
    KXXXXXX     203132
    QXXXXXX     188408
    VXXXXXX     176435
Min Kmer Size:   0
Empty list: 1073051111

Time for index table init: 0 h 5 m 36s

k-mer similarity threshold: 115
k-mer match probability: 0

Starting prefiltering scores calculation (step 0 of 4)
Query db start  0 to 0
Target db start  0 to 9343283
Program call:
tara_test.faa uniref90/uniref90.mms /home/hingamp.p/tmp/pref_4 /home/hingamp.p/tmp/aln_4 --sub-mat blosum62.out -a --alignment-mode 0 -e 0.001 --min-seq-id 0 -c
 0 --max-seq-len 32000 --max-seqs 300 --comp-bias-corr 1 --max-rejected 2147483647 --threads 24 -v 3 

MMseqs Version:             e3588acbec735d8aa3158f7bdf38870bebc7d6df
Sub Matrix                  blosum62.out
Add backtrace               true
Alignment mode              0
E-value threshold           0.001
Seq. Id Threshold           0
Coverage threshold          0
Max. sequence length        32000
Max. results per query      300
Compositional bias          1
Profile                     false
Realign hit                 false
Max Reject                  2147483647
Detect fragments            false
Include identical Seq. Id.  false
Threads                     24
Verbosity                   3

MPI Init...
Rank: 0 Size: 1
Init data structures...
Compute score, coverage and sequence id.
Using 24 threads.
martin-steinegger commented 8 years ago

Did you inted to compile with MPI? Could you please send me your cmake call and output? If you do not want to use MPI you can specify cmake -DHAVE_MPI=0 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=. ..

The MPI compiled version can not be used as stand alone binary. You have to specificy an MPI runner and the amount of nodes you want to run your code. E.g. RUNNER="mpirun -np 42" mmseqs search queryDB targetDB resultDB tmp.

I also see that your query set has the size 0. Query database: tara_test.faa(size=0). How did you build this query set? Could you please send us your workflow?

Zaphod-dev commented 8 years ago

I'm sorry for the delays in replying due to concomitant teaching duties... I fixed my two mistakes (specifying the fasta file instead of the query DB, and recompiled without MPI which I probably don't really require) and the search runs perfectly! First tests suggest that indeed mmseqs2 sensitivity is far greater that ghostx or Rapsearch2 with equivalent or even better speed (I have to make more tests to measure speed on our cluster the nodes of which have less RAM than the server I used to build the UniRef90 DB and run the first searches). Many thanks again for your most helpful advice, I'm mighty happy to have pursued this through :)

Zaphod-dev commented 8 years ago

I have run into a new issue: the UniRef90 indexing and searching having been tested on a 132GB RAM server, I tested the searching on a 64GB RAM node typical of our cluster. My first attempt consisted (retrospectively naively) in using the UniRef90 indexes created (without any --split option) on the 132GB server, which caused a crash after a useful warning MMseqs processes needs more main memory than available.Increase the size of --split as seen below in the first output.

I therefore attempted to index the database on the 64GB RAM node directly with no explicit --split option, or by specifying--split 6 (an explicit split is useful to index the database taking into account the fact that some of the cluster nodes have less than the 64GB RAM of the machine I'm indexing on). But this indexing fails with the second output provided below? In fact the indexing fails also without the --split option (see third output below).

All tests have been carried out using a compiled mmseqs cloned from a fresh git clone. The work directory has 76TB of free space.

I'm really at a loss as to what the Could not write to data file / error message might indicate?

-bash-4.2$ mmseqs search subseq_Mms.1 uniref90.mms toto.mms tmp/ --max-seqs 100000 --threads 8 -a true -e 1E-10 -s 1 
Program call:
subseq_Mms.1 uniref90.mms toto.mms tmp/ --max-seqs 100000 --threads 8 -a true -e 1E-10 -s 1 

MMseqs Version:                     c5615b34c686b1cf0f200670be8bc6cba76d90f9
Sub Matrix                          blosum62.out
Add backtrace                       true
Alignment mode                      0
E-value threshold                   1e-10
Seq. Id Threshold                   0
Coverage threshold                  0
Max. sequence length                32000
Max. results per query              100000
Compositional bias                  1
Profile                             false
Realign hit                         false
Max Reject                          2147483647
Detect fragments                    false
Include identical Seq. Id.          false
Threads                             8
Verbosity                           3
Sensitivity                         1
K-mer size                          0
K-score                             2147483647
Alphabet size                       21
Offset result                       0
Split DB                            0
Split mode                          2
Diagonal Scoring                    1
Minimum Diagonal score              15
Spaced Kmer                         1
Profile e-value threshold           0.001
Use global sequence weighting       false
Maximum sequence identity threshold 0.9
Minimum seq. id.                    0
Minimum score per column            -20
Minimum coverage                    0
Select n most diverse seqs          100
Pseudo count a                      1
Pseudo count b                      1.5
Number search iterations            1
Start sensitivity                   4
sensitivity step size               1
Sets the MPI runner                 

/home/hingamp.p
/home/hingamp.p
Program call:
subseq_Mms.1 uniref90.mms tmp/pref_1 --sub-mat blosum62.out -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --max-seqs 100000 --offset-result 0 --split 0 --split-mode 2 --comp-bias-corr 1 --diag-score 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --threads 8 -v 3 -s 1 

MMseqs Version:             c5615b34c686b1cf0f200670be8bc6cba76d90f9
Sub Matrix                  blosum62.out
Sensitivity                 1
K-mer size                  0
K-score                     2147483647
Alphabet size               21
Max. sequence length        32000
Profile                     false
Max. results per query      100000
Offset result               0
Split DB                    0
Split mode                  2
Compositional bias          1
Diagonal Scoring            1
Minimum Diagonal score      15
Include identical Seq. Id.  false
Spaced Kmer                 1
Threads                     8
Verbosity                   3

Initialising data structures...
Using 8 threads.

Use index  uniref90.mms.sk7
Index version: 774909490
KmerSize:     7
AlphabetSize: 21
Skip:         0
Split:        1
Type:         1
Spaced:       1
Query database: subseq_Mms.1(size=32)
Target database: uniref90.mms(size=44448995)
Use kmer size 7 and split 1 using split mode 0
Needed memory (139010009596 byte) of total memory (67278442496 byte)
WARNING: MMseqs processes needs more main memory than available.Increase the size of --split or set it to 0 to automatic optimize target database split.
WARNING: Split has to be computed by createindex if precomputed index is used.
Substitution matrices...
Time for init: 0 h 0 m 12s

Process prefiltering step 0 of 1

tmp/pref_1_tmp_0.0: File exists
Program call:
subseq_Mms.1 uniref90.mms tmp/pref_1 tmp/aln_1 --sub-mat blosum62.out -a --alignment-mode 0 -e 1e-10 --min-seq-id 0 -c 0 --max-seq-len 32000 --max-seqs 100000 --comp-bias-corr 1 --max-rejected 2147483647 --threads 8 -v 3 

MMseqs Version:             c5615b34c686b1cf0f200670be8bc6cba76d90f9
Sub Matrix                  blosum62.out
Add backtrace               true
Alignment mode              0
E-value threshold           1e-10
Seq. Id Threshold           0
Coverage threshold          0
Max. sequence length        32000
Max. results per query      100000
Compositional bias          1
Profile                     false
Realign hit                 false
Max Reject                  2147483647
Detect fragments            false
Include identical Seq. Id.  false
Threads                     8
Verbosity                   3

Init data structures...
Compute score, coverage and sequence id.
Using 8 threads.
Could not open data file tmp/pref_1!
mv: impossible d'évaluer « tmp/aln_1 »: Aucun fichier ou dossier de ce type
$mmseqs createindex uniref90.mms uniref90.mms.sk7 tmp --split 6 --threads 12
Program call:
uniref90.mms uniref90.mms.sk7 tmp --split 6 --threads 12 

MMseqs Version:         c5615b34c686b1cf0f200670be8bc6cba76d90f9
Sub Matrix              blosum62.out
K-mer size              0
Alphabet size           21
Max. sequence length    32000
Split DB                6
Spaced Kmer             1
Threads                 12
Verbosity               3

Substitution matrices...
Use kmer size 7 and split 6 using split mode
Index table: counting k-mers...
.WARNING: Sequence (dbKey=10870) contains only ATGC. It might be a nucleotide sequence.
..................................................................................................  1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
................................................................................................... 4 Mio. sequences processed
................................................................................................... 5 Mio. sequences processed
.......................................................................WARNING: Sequence (dbKey=5712154) contains only ATGC. It might be a nucleotide sequence.
............................    6 Mio. sequences processed
........................
Index table: Masked residues: 40394498
Index table: fill...
................................................................................................... 1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
................................................................................................... 4 Mio. sequences processed
................................................................................................... 5 Mio. sequences processed
................................................................................................... 6 Mio. sequences processed
........................
Index table: removing duplicate entries...
Index table init done.

Write 10
Write 20
Write 60
Could not write to data file /
Program call:
uniref90.mms uniref90.mms.sk7 tmp --threads 8 

MMseqs Version:         c5615b34c686b1cf0f200670be8bc6cba76d90f9
Sub Matrix              blosum62.out
K-mer size              0
Alphabet size           21
Max. sequence length    32000
Split DB                0
Spaced Kmer             1
Threads                 8
Verbosity               3

Substitution matrices...
Use kmer size 7 and split 4 using split mode
Index table: counting k-mers...
.WARNING: Sequence (dbKey=10870) contains only ATGC. It might be a nucleotide sequence.
..................................................................................................  1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
................................................................................................... 4 Mio. sequences processed
................................................................................................... 5 Mio. sequences processed
.......................................................................WARNING: Sequence (dbKey=5712154) contains only ATGC. It might be a nucleotide sequence.
............................    6 Mio. sequences processed
......................................WARNING: Sequence (dbKey=6387662) contains only ATGC. It might be a nucleotide sequence.
.............................................................   7 Mio. sequences processed
................................................................................................... 8 Mio. sequences processed
................................................................................................... 9 Mio. sequences processed
..................................
Index table: Masked residues: 70834663
Index table: fill...
................................................................................................... 1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
................................................................................................... 4 Mio. sequences processed
................................................................................................... 5 Mio. sequences processed
................................................................................................... 6 Mio. sequences processed
................................................................................................... 7 Mio. sequences processed
................................................................................................... 8 Mio. sequences processed
................................................................................................... 9 Mio. sequences processed
..................................
Index table: removing duplicate entries...
Index table init done.
Write 10
Write 20
Could not write to data file /
martin-steinegger commented 8 years ago

I'm happy to hear that MMseqs2 performs well in your Benchmark. You can adjust the sensitivity/speed of with the -s parameter.

MMseqs2 expected that the computer that creates the index to have the same amount of memory as the computer that performs the search. If you don't want to precompute an index than you can call the search command without precomputed index. A non-persistent index is than created on the fly.

The command createindex should be $mmseqs createindex uniref90.mms --split 6 --threads 12 instead of $mmseqs createindex uniref90.mms uniref90.mms.sk7 tmp --split 6 --threads 12. Why did you call it this way? Is this somewhere in the documentation?

Zaphod-dev commented 8 years ago

I called createindex this way following the mmseqs internal documentation:

-bash-4.2$ mmseqs createindex
mmseqs createindex:
Precomputes an index table for the sequence DB. Handing over the precomputed index table as input to mmseqs search or mmseqs prefilter eliminates the computational overhead of building the index table on the fly.

Please cite:
Steinegger, M. & Soding, J. Sensitive protein sequence searching for the analysis of massive data sets. bioRxiv XXXX (2016)

© Martin Steinegger <martin.steinegger@mpibpc.mpg.de>

Usage: <i:sequenceDB> <o:indexDB> <tmpDir> [options]

prefilter options       default     description [value range]
  -k                    0           k-mer size in the range [6,7] (0: set automatically to optimum)
  --alph-size           21          alphabet size [2,21]                                        
  --split               0           splits target set in n equally distributed chunks. In default the split is automatically set
  --spaced-kmer-mode    1           0: use consecutive positions a k-mers; 1: use spaced k-mers 

clustlinear options     default     description [value range]
  -k                    0           k-mer size in the range [6,7] (0: set automatically to optimum)
  --alph-size           21          alphabet size [2,21]                                        

common options          default     description [value range]
  --sub-mat             blosum62.out    amino acid substitution matrix file                         
  --max-seq-len         32000       Maximum sequence length [1,32768]                           
  --threads             32          number of cores used for the computation (uses all cores by default)
  -v                    3           verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info

1 Database paths are required

When I omit the <o:indexDB> <tmpDir> parameters in the command line, the createindex command still fails:

-bash-4.2$ mmseqs createindex uniref90.mms --split 8 --threads 8
Program call:
uniref90.mms --split 8 --threads 8 

MMseqs Version:         c5615b34c686b1cf0f200670be8bc6cba76d90f9
Sub Matrix              blosum62.out
K-mer size              0
Alphabet size           21
Max. sequence length    32000
Split DB                8
Spaced Kmer             1
Threads                 8
Verbosity               3

Substitution matrices...
Use kmer size 7 and split 8 using split mode
Index table: counting k-mers...
.WARNING: Sequence (dbKey=10870) contains only ATGC. It might be a nucleotide sequence.
..................................................................................................  1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
................................................................................................... 4 Mio. sequences processed
..................................................................
Index table: Masked residues: 22642771
Index table: fill...
................................................................................................... 1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
................................................................................................... 4 Mio. sequences processed
..................................................................
Index table: removing duplicate entries...
Index table init done.

Write 10
Write 20
Write 60
Write 70
Write 80
Write 30
Write 40
Index table: counting k-mers...
................................................................................................... 1 Mio. sequences processed
....WARNING: Sequence (dbKey=5712154) contains only ATGC. It might be a nucleotide sequence.
...................................................................WARNING: Sequence (dbKey=6387662) contains only ATGC. It might be a nucleotide sequence.
............................    2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
................................................................................................... 4 Mio. sequences processed
...................................................................
Index table: Masked residues: 48191892
Index table: fill...
................................................................................................... 1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
................................................................................................... 4 Mio. sequences processed
...................................................................
Index table: removing duplicate entries...
Index table init done.

Write 11
Could not write to data file /
Zaphod-dev commented 8 years ago

I have an explanation for the above error message: even though there are TB of free space on the partition I'm using, I had reached my quota... After some spring cleaning, the above mmseqs createindex uniref90.mms --split 8 command completed successfully! Sorry for the unnecessary last report.

Zaphod-dev commented 8 years ago

A quick comment, in case it should help other users in a similar situation: it took me a while to understand why the performance of mmseqs2 search speed was at least an order of magnitude slower on our compute cluster than it was on a single server. It turns out the UniRef90 target DB index file with sufficient split (16) to accommodate the modest node RAM (32GB) was over 200GB in size and therefore too large to fit on the limited node local disks, so it stayed on the network shared disk bay (nonetheless with reasonable Infiniband 40Gb/s connection to the nodes). With such a configuration the mmseqs2 search jobs on the nodes were endlessly I/O bound, using a ~20% fraction of a core instead of the 16 cores available. As soon as I deleted the target DB index files (sk7), the mmseqs2 search jobs distributed on the cluster nodes performed again really well! So in a nutshell, I would recommend not using pre-indexed DB files on a compute cluster when these large files can't reside on a local disk, and instead use on the fly indexing which of course induces a significant overhead, but orders of magnitude less significant than the actual index file access on shared network storage. And of course this on the fly indexing overhead becomes less important as the query DB gets larger :) With large query DBs, I'm very impressed by the speed of (non MPI) distributed mmeseqs on our modest cluster (whilst providing excellent sensitivity of course)!

martin-steinegger commented 8 years ago

Thanks you a lot for analyzing this behaviour. I never ran into this problem since our nodes have 128GB. The best stragety is to keep the index on a local SSD drive if possible. I will add this information to the user guide. However I think the most user will use it like you did since its the most comfortable solution. I will think about a away how MMseqs2 can automatically decide whats the best strategy.

Why do you split the database 8 times? MMseqs2 should automatically decide on the best amount of splits if you don't specifiy the --split parameter.

If you are more interested in MMseqs2 than you can check out our paper at the biorxiv http://www.biorxiv.org/content/early/2016/10/07/079681.

Zaphod-dev commented 8 years ago

Hi, I had used the explicit 'split' size, because I intended to create the index on a different machine than the one where the index would be used (which have less RAM). But because I'm now indexing on the fly, I no longer need to fiddle with the 'split' option :) Clearly a local SSD on each node would be ideal, but with index files as large as 200GB per database, this would require either deleting the index file after each job (therefore transferring the index file before each job, too much network traffic) or installing giant SSDs?...

martin-steinegger commented 8 years ago

I consider this issue closed. Please open a new issue if you expirience further problems. Thanks a lot for you feedback and for benchmarking MMseqs2.