soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

MMseq2 dies after write error #478

Open torstenthomas opened 3 years ago

torstenthomas commented 3 years ago

Hello.

MMSeq2 dies after write error, The shell output is below. Anyone knows why this happens?

Thanks,

Torsten


Create directory tmp search kelp_database uniref50 results tmp

MMseqs Version: GITDIR-NOTFOUND Substitution matrix nucl:nucleotide.out,aa:blosum62.out Add backtrace false Alignment mode 2 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0 Coverage mode 0 Max sequence length 65535 Compositional bias 1 Max reject 2147483647 Max accept 2147483647 Include identical seq. id. false Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Zdrop 40 Threads 40 Compressed 0 Verbosity 3 Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 5.7 k-mer length 0 k-score 2147483647 Alphabet size nucl:5,aa:21 Max results per query 300 Split database 0 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask lower case residues 0 Minimum diagonal score 15 Spaced k-mers 1 Spaced k-mer pattern
Local temporary path
Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Mask profile 1 Profile E-value threshold 0.1 Global sequence weighting false Allow deletions false Filter MSA 1 Maximum seq. id. threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Chain overlapping alignments 0 Merge query 1 Search type 0 Search iterations 1 Start sensitivity 4 Search steps 1 Exhaustive search mode false Filter results during exhaustive search 0 Strand selection 1 LCA search mode false Disk space limit 0 MPI runner
Force restart with latest tmp false Remove temporary files false

extractorfs kelp_database tmp/3499313520568641582/q_orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table- starts 0 --id-offset 0 --create-lookup 0 --threads 40 --compressed 0 -v 3

[=================================================================] 714.62K 18s 1ms Time for merging to q_orfs_aa_h: 0h 0m 17s 230ms Time for merging to q_orfs_aa: 0h 0m 22s 109ms Time for processing: 0h 1m 20s 933ms prefilter tmp/3499313520568641582/q_orfs_aa uniref50 tmp/3499313520568641582/search/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out - -seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --sp lit 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lo wer-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 40 --compressed 0 -v 3 -s 5.7

Query database size: 61153309 type: Aminoacid Target split mode. Searching through 2 splits Estimated memory consumption: 99G Target database size: 48531432 type: Aminoacid Process prefiltering step 1 of 2

Index table k-mer threshold: 122 at k-mer size 7 Index table: counting k-mers [=================================================================] 24.26M 1m 44s 733ms Index table: Masked residues: 187925951 Index table: fill [=================================================================] 24.26M 3m 17s 822ms Index statistics Entries: 6286866786 DB size: 45739 MB Avg k-mer size: 4.911615 Top 10 k-mers DFEQLPH 32892 NVPGWSP 32831 FRYAFPS 32736 RYYVLGW 32688 WRLDFLN 31763 TVDGDFS 31579 NKTDFVQ 31135 QDWVQIP 30874 LDGAYVP 30051 ETGRYNV 29832 Time for index table init: 0h 5m 17s 428ms k-mer similarity threshold: 122 Starting prefiltering scores calculation (step 1 of 2) Query db start 1 to 61153309 Target db start 1 to 24258060 [=================================================================] 61.15M 27h 30m 21s 285ms

2412.140792 k-mers per position 608927 DB matches per sequence 175 overflows 0 queries produce too many hits (truncated result) 196 sequences passed prefiltering per query sequence 198 median result list length 34327 sequences with 0 size result lists Time for merging to pref_0_tmp_0: 0h 0m 44s 467ms Time for merging to pref_0_tmp_0_tmp: 0h 12m 8s 854ms Process prefiltering step 2 of 2

Index table k-mer threshold: 122 at k-mer size 7 Index table: counting k-mers [=================================================================] 24.27M 1m 54s 630ms Index table: Masked residues: 187586445 Index table: fill [=================================================================] 24.27M 3m 32s 124ms Index statistics Entries: 6287362445 DB size: 45742 MB Avg k-mer size: 4.912002 Top 10 k-mers DFEQLPH 33023 NVPGWSP 32989 QGKSPFQ 32900 FRYAFPS 32880 RYYVLGW 32788 WRLDFLN 31914 TVDGDFS 31713 NKTDFVQ 31393 QDWVQIP 31110 LDGAYVP 30048 Time for index table init: 0h 5m 43s 381ms k-mer similarity threshold: 122 Starting prefiltering scores calculation (step 2 of 2) Query db start 1 to 61153309 Target db start 24258061 to 48531432 [=================================================================] 61.15M 28h 53m 9s 351ms

2412.140792 k-mers per position 608911 DB matches per sequence 174 overflows 0 queries produce too many hits (truncated result) 196 sequences passed prefiltering per query sequence 198 median result list length 34543 sequences with 0 size result lists Time for merging to pref_0_tmp_1: 0h 0m 35s 830ms Time for merging to pref_0_tmp_1_tmp: 0h 10m 54s 136ms Merging 2 target splits to pref_0 Preparing offsets for merging: 0h 0m 19s 175ms [=================================================================] 61.15M 26m 22s 19ms Time for merging to pref_0: 0h 0m 36s 177ms Time for merging target splits: 0h 27m 38s 35ms write error Error: Prefilter died Error: Search step died -bash-4.2$

milot-mirdita commented 3 years ago

Is tmp on a network share? Could you try placing it on a local disk? Is there enough free space?

torstenthomas commented 3 years ago

Thanks. tmp is on a local disk, but I only have 1TB availabe. That disk disk space gets full during execution, although I only want to run about 700K sequences against UniRef50. I will try --remove-tmp-files.

milot-mirdita commented 3 years ago

You can slightly increase the minimum extracted ORF fragment length --min-length (30 amino acids/90nucleotides is the default).

--remove-tmp-files decides if the contents of the tmp folder should remain after the search is finished, it will not reduce peak disk use.

milot-mirdita commented 3 years ago

I was meaning to add the same additional prefiltering stage we used int the recent MMseqs2 taxonomy paper to the normal search, this would also speed-up and reduce disk use at a slight sensitivity penalty. But I haven't gotten around to that.

torstenthomas commented 3 years ago

Thanks. This worked.