soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.39k stars 195 forks source link

Search with Error: Alignment died #134

Closed acpguedes closed 5 years ago

acpguedes commented 5 years ago

Expected Behavior

Search 13311 queries against database of 17M sequences

Current Behavior

Fails in alignment step:

Steps to Reproduce (for bugs)

mmseqs search tcdb_query.nr.db /databases/fadb/freeze/all.mmseqs tcdb_result.db tmp --threads 15 -s 7.5 --num-iterations 3 -a --max-seqs 17702628 -c 0.8 --add-self-matches

MMseqs Output (for bugs)

Program call: search tcdb_query.nr.db /databases/fadb/freeze/all.mmseqs tcdb_result.db tmp --threads 15 -s 7.5 --num-iterations 3 -a --max-seqs 17702628 -c 0.8 --add-self-matches

MMseqs Version: 7ca117893675cdca309e2c9dfc444bbc7462e858 Sub Matrix blosum62.out Add backtrace true Alignment mode 2 E-value threshold 0.001 Seq. Id Threshold 0 Seq. Id. Mode 0 Alternative alignments 0 Coverage threshold 0.8 Coverage Mode 0 Max. sequence length 65535 Max. results per query 17702628 Compositional bias 1 Realign hit false Max Reject 2147483647 Max Accept 2147483647 Include identical Seq. Id. true Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Score bias 0 Gap open cost 11 Gap extension cost 1 Threads 15 Verbosity 3 Sensitivity 7.5 K-mer size 0 K-score 2147483647 Alphabet size 21 Offset result 0 Split DB 0 Split mode 2 Split Memory Limit 0 Diagonal Scoring 1 Exact k-mer matching 0 Mask Residues 1 Minimum Diagonal score 15 Spaced Kmer 1 Spaced k-mer pattern
Rescore mode 0 Remove hits by seq.id. and coverage false Sort results 0 In substitution scoring mode, performs global alignment along the diagonal false Mask profile 1 Profile e-value threshold 0.1 Use global sequence weighting false Filter MSA 1 Maximum sequence identity threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select n most diverse seqs 1000 Omit Consensus false Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward Frames 1,2,3 Reverse Frames 1,2,3 Translation Table 1 Use all table starts false Offset of numeric ids 0 Add Orf Stop false Number search iterations 3 Start sensitivity 4 Search steps 1 Run a seq-profile search in slice mode false Strand selection 1 Disk space limit 0 Sets the MPI runner
Remove Temporary Files false

Program call: prefilter tcdb_query.nr.db /databases/fadb/freeze/all.mmseqs tmp/18071041534032520912/pref_0 --sub-mat blosum62.out -s 7.5 -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 65535 --max-seqs 17702628 --offset-result 0 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --min-ungapped-score 15 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 15 -v 3

MMseqs Version: 7ca117893675cdca309e2c9dfc444bbc7462e858 Sub Matrix blosum62.out Sensitivity 7.5 K-mer size 0 K-score 2147483647 Alphabet size 21 Max. sequence length 65535 Max. results per query 17702628 Offset result 0 Split DB 0 Split mode 2 Split Memory Limit 0 Coverage threshold 0.8 Coverage Mode 0 Compositional bias 1 Diagonal Scoring 1 Exact k-mer matching 0 Mask Residues 1 Minimum Diagonal score 15 Include identical Seq. Id. true Spaced Kmer 1 Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Spaced k-mer pattern
Threads 15 Verbosity 3

Initialising data structures... Using 15 threads. Use index /databases/fadb/freeze/all.mmseqs.sk7 Index version: 12 Generated by: 7ca117893675cdca309e2c9dfc444bbc7462e858 MaxSeqLength: 65535 KmerSize: 7 CompBiasCorr: 1 AlphabetSize: 21 Masked: 1 Spaced: 1 KmerScore: 81 SequenceType: 0 Headers1: 1 Headers2: 0 ScoreMatrix: blosum62.out Substitution matrices... Substitution matrices... Use kmer size 7 and split 1 using Target split mode. Needed memory (68450582001 byte) of total memory (181308646195 byte) Target database: /databases/fadb/freeze/all.mmseqs(Size: 17702628) Query database type: Aminoacid Target database type: Aminoacid Time for init: 0h 3m 7s 669ms Query database: tcdb_query.nr.db(size=13311) Process prefiltering step 1 of 1

k-mer similarity threshold: 81 k-mer match probability: 0

Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 13311 Target db start 1 to 17702628 Sequence 421 produces too many hits. Results might be truncated Sequence 470 produces too many hits. Results might be truncated Sequence 468 produces too many hits. Results might be truncated Sequence 1015 produces too many hits. Results might be truncated Sequence 1865 produces too many hits. Results might be truncated Sequence 2100 produces too many hits. Results might be truncated Sequence 3033 produces too many hits. Results might be truncated Sequence 3096 produces too many hits. Results might be truncated Sequence 3465 produces too many hits. Results might be truncated Sequence 4276 produces too many hits. Results might be truncated Sequence 4262 produces too many hits. Results might be truncated Sequence 4402 produces too many hits. Results might be truncated Sequence 5038 produces too many hits. Results might be truncated Sequence 5141 produces too many hits. Results might be truncated Sequence 5661 produces too many hits. Results might be truncated Sequence 6394 produces too many hits. Results might be truncated Sequence 6621 produces too many hits. Results might be truncated Sequence 6807 produces too many hits. Results might be truncated Sequence 7051 produces too many hits. Results might be truncated Sequence 7680 produces too many hits. Results might be truncated Sequence 8382 produces too many hits. Results might be truncated Sequence 9179 produces too many hits. Results might be truncated .Sequence 10717 produces too many hits. Results might be truncated Sequence 11844 produces too many hits. Results might be truncated Sequence 12630 produces too many hits. Results might be truncated Sequence 12971 produces too many hits. Results might be truncated

38654 k-mers per position. 57944556 DB matches per sequence. 8775 Overflows. 1852622 sequences passed prefiltering per query sequence. Median result list size: 1612533 0 sequences with 0 size result lists.

Time for prefiltering scores calculation: 0h 56m 42s 147ms Time for merging files: 0h 17m 3s 113ms Time for processing: 1h 16m 55s 652ms Program call: align tcdb_query.nr.db /databases/fadb/freeze/all.mmseqs tmp/18071041534032520912/pref_0 tmp/18071041534032520912/aln_0 --sub-mat blosum62.out -a 1 --alignment-mode 2 -e 0.1 --min-seq-id 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --max-seqs 17702628 --comp-bias-corr 1 --realign 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 1 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --threads 15 -v 3

MMseqs Version: 7ca117893675cdca309e2c9dfc444bbc7462e858 Sub Matrix blosum62.out Add backtrace true Alignment mode 2 E-value threshold 0.1 Seq. Id Threshold 0 Seq. Id. Mode 0 Alternative alignments 0 Coverage threshold 0.8 Coverage Mode 0 Max. sequence length 65535 Max. results per query 17702628 Compositional bias 1 Realign hit true Max Reject 2147483647 Max Accept 2147483647 Include identical Seq. Id. true Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Score bias 0 Gap open cost 11 Gap extension cost 1 Threads 15 Verbosity 3

Init data structures... Compute score only. Use index /databases/fadb/freeze/all.mmseqs.sk7 Index version: 12 Generated by: 7ca117893675cdca309e2c9dfc444bbc7462e858 MaxSeqLength: 65535 KmerSize: 7 CompBiasCorr: 1 AlphabetSize: 21 Masked: 1 Spaced: 1 KmerScore: 81 SequenceType: 0 Headers1: 1 Headers2: 0 ScoreMatrix: blosum62.out Touch data file tcdb_query.nr.db ... Done. Query database type: Aminoacid Target database type: Aminoacid Calculation of Smith-Waterman alignments. Error: Alignment died

Context

I have two databases, they don't have equal entries but some sequences have 100% identity. I try to search using --add-self-matches to cluster the result. The search fails in alignment step when I use this options, without this options the search run well.

Your Environment

martin-steinegger commented 5 years ago

Could you please update your version and recreate the index there was a bug in this version. If the bug still persists please could you please send us the debug backtrace? To debug first compile MMseqs2 in debug mode

 cmake -DCMAKE_BUILD_TYPE=Debug .. 
 make 

And then run the alignment with gdb

 gdb --args mmseqs align tcdb_query.nr.db /databases/fadb/freeze/all.mmseqs tmp/18071041534032520912/pref_0 tmp/18071041534032520912/aln_0 --sub-mat blosum62.out -a 1 --alignment-mode 2 -e 0.1 --min-seq-id 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --max-seqs 17702628 --comp-bias-corr 1 --realign 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 1 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --threads 15 -v 3

After the crash just type

 bt
martin-steinegger commented 5 years ago

Is this problem resolved now?

acpguedes commented 5 years ago

Unfortunately no. Also, sorry for don't ask before.

So I did this way:
1- Download the MMseqs2 Version: aa14ce37feb5eda7231af20259d8f2b659162236
2- Compile as described here but changing cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. .. by cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=. ..
3- Run all steps below:

mmseqs createdb tcdb_query.nr.fa tcdb_query.nr.db 2>&1 >> log
mmseqs createdb all.fa all.db 2>&1 >> log;
mmseqs search tcdb_query.nr.db all.db tcdb_result.db tmp --threads 40 -s 7.5 --num-iterations 3 -a --max-seqs 17702628 -c 0.8 --add-self-matches 2>&1 >> log

4- After crash:

gdb --args mmseqs align tcdb_query.nr.db all.db tmp/1072319213335698383/pref_0 tmp/1072319213335698383/aln_0 --sub-mat blosum62.out -a 1 --alignment-mode 2 -e 0.1 --min-seq-id 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --max-seqs 17702628 --comp-bias-corr 1 --realign 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 1 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --threads 40 -v 3

return:

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/acpguedes/source/MMseqs2/build/bin/mmseqs...done.
(gdb) bt
No stack.

Important note: I send the STDOUT and STDERR to log file but there are one message that was printed on the screen and not on log file:
scoreIdentical has different length L: scoreIdentical has different length L: 604154 query_length: 126 query_length: 626

I can send you the entire directory but it is too large to upload on github.

martin-steinegger commented 5 years ago

The --add-self-matches flag just works with databases that have the same size and entry identifiers. I assume this makes the example crash.