soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.39k stars 195 forks source link

mmseqs search report error and Alignment died when --num-iterations >= 2 #747

Open hwy7 opened 1 year ago

hwy7 commented 1 year ago

Expected Behavior

Successful create a search resultDB when run mmseqs search query/queryDB target/tragetDB search/resultDB -s 7.5 --search-type 3 but fail when run mmseqs search query/queryDB target/tragetDB search/resultDB -s 7.5 --search-type 3 --num-iterations 2

Current Behavior

Error: Alignment died Error: Search step died

Steps to Reproduce (for bugs)

mmseqs createdb query.fasta query/queryDB
mmseqs createdb targegt.fasta target/targetDB
mmseqs query/queryDB target/targetDB search/resultDB tmp -s 7.5 --search-type 3 --num-iterations 2

MMseqs Output (for bugs)

MMseqs Version: df77d9e6cf640fe8990f247441ab44d4f4ad9cdc Substitution matrix aa:blosum62.out,nucl:nucleotide.out Add backtrace true Alignment mode 3 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0 Coverage mode 2 Max sequence length 10000 Compositional bias 1 Compositional bias 1 Max reject 2147483647 Max accept 2147483647 Include identical seq. id. false Preload mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Correlation score weight 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Zdrop 40 Threads 96 Compressed 0 Verbosity 3 Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out Sensitivity 7.5 k-mer length 15 Target search mode 0 k-score seq:2147483647,prof:2147483647 Alphabet size aa:21,nucl:5 Max results per query 300 Split database 0 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 1 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Minimum diagonal score 15 Selected taxa
Spaced k-mers 1 Spaced k-mer pattern
Local temporary path
Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Mask profile 1 Profile E-value threshold 0.1 Global sequence weighting false Allow deletions false Filter MSA 1 Use filter only at N seqs 0 Maximum seq. id. threshold 0.9 Minimum seq. id. 0.0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Pseudo count mode 0 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Chain overlapping alignments 0 Merge query 1 Search type 3 Search iterations 2 Start sensitivity 4 Search steps 1 Exhaustive search mode false Filter results during exhaustive search 0 Strand selection 2 LCA search mode false Disk space limit 0 MPI runner
Force restart with latest tmp false Remove temporary files false

splitsequence sub/subDB tmp/7935334228278574252/target_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 96 --compressed 0 -v 3

[=================================================================] 100.00% 365.60K 1s 853ms
Time for merging to target_seqs_split_h: 0h 0m 0s 83ms Time for merging to target_seqs_split: 0h 0m 0s 97ms Time for processing: 0h 0m 2s 329ms extractframes querydata/queryDB tmp/7935334228278574252/query_seqs --forward-frames 1 --reverse-frames 1 --create-lookup 0 --threads 96 --compressed 0 -v 3

[=================================================================] 100.00% 2.00K 0s 18ms
Time for merging to query_seqs_h: 0h 0m 0s 62ms Time for merging to query_seqs: 0h 0m 0s 6ms Time for processing: 0h 0m 0s 213ms splitsequence tmp/7935334228278574252/query_seqs tmp/7935334228278574252/query_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 96 --compressed 0 -v 3

Time for processing: 0h 0m 0s 0ms prefilter tmp/7935334228278574252/query_seqs_split tmp/7935334228278574252/target_seqs_split tmp/7935334228278574252/search/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 7.5 -k 15 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 10000 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 2 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 1 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3

Query database size: 4000 type: Nucleotide Estimated memory consumption: 11G Target database size: 365688 type: Nucleotide Index table k-mer threshold: 0 at k-mer size 15 Index table: counting k-mers [=================================================================] 100.00% 365.69K 16s 177ms
Index table: Masked residues: 1079896 Index table: fill [=================================================================] 100.00% 365.69K 12s 498ms
Index statistics Entries: 297952985 DB size: 9896 MB Avg k-mer size: 0.277490 Top 10 k-mers GGCGCAGCGCGGTGC 366 TCCGGGCCGCACGGT 330 GTCGCGGCAGCGCCG 209 CAGACGCGCGTGCCG 204 CGCGCGCGTCGCGCG 167 CGCGCGCGTGGCGCG 157 GCTGCGCGCGGCGCG 151 CGCGGGCGTGGCGCG 149 CGTGCGCGTGGCGCG 147 CGCGCGCCCGGCGCG 133 Time for index table init: 0h 0m 39s 203ms Process prefiltering step 1 of 1

k-mer similarity threshold: 0 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 4000 Target db start 1 to 365688 [=================================================================] 100.00% 4.00K 0s 74ms
[================================================================>] 99.72% 3.99K eta 0s
0.926667 k-mers per position 434 DB matches per sequence 0 overflows 4 sequences passed prefiltering per query sequence 1 median result list length 1762 sequences with 0 size result lists Time for merging to pref_0: 0h 0m 0s 5ms Time for processing: 0h 0m 40s 369ms align tmp/7935334228278574252/query_seqs_split tmp/7935334228278574252/target_seqs_split tmp/7935334228278574252/search/pref_0 tmp/7935334228278574252/search/aln_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 2 --max-seq-len 10000 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 1 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3

Compute score only Query database size: 4000 type: Nucleotide Target database size: 365688 type: Nucleotide Calculation of alignments Query sequence 236 has a result with no diagonal information. Please check your database. Error: Alignment died Error: Search step died

Your Environment

Include as many relevant details about the environment you experienced the bug in.

milot-mirdita commented 1 year ago

First: both the sensitivity parameter and the iteration parameter do not do anything for nucleotide MMseqs2 searches. sensitivity is the parameter for adjusting the length of the similar k-mer lists, which are not generated for nucleotides (all substitutions have the same score, so you can't generate similar k-mers).

Profile searches are also not implemented for nucleotides.

However, the error is still very surprising and should not happen. Could you share the sequences with us?

hwy7 commented 1 year ago

Thank you for your reply my target sequences are some CDS sequences download from NCBI, and query sequences are some sequences fragment of 300bp here are some partial sequences of the target and query file. https://gist.github.com/hwy7/cd5486d2a61c3b6bfe990a3ada669318 Please let me know if you need any more information or if there are specific analyses you would like to perform with this data. Thanks