soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.41k stars 195 forks source link

Error: Prefilter died #774

Open art-egorov opened 1 year ago

art-egorov commented 1 year ago

I'm running easy-search for a set of fasta files. For the majority of files everything is fine, for a small subset i'm getting the same error after prefiltering.

That's an example of my command: ../bin/mmseqs/bin/mmseqs easy-search s01_complete_refseq_representative_fasta_DEVIDED/mmseqs_rep_d_2.fa mmseqs/mmseqs_clu_rep_db/DB mmseqs_test.tsv tmp1 --format-mode 4 --num-iterations 5 -e 1e-5 --format-output query,target,fident,alnlen,mism atch,gapopen,qstart,qend,tstart,tend,evalue,bits --max-seqs 1000000 -s 6

MMseqs Output (for bugs)

MMseqs Version:                         b22d5f6d02cb27ebc2cd931d8d20fe92ff54b8a8
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out                                                                                                                                                                                                    
Add backtrace                           false
Alignment mode                          3                                                                                                                                                                                                                                      
Alignment mode                          0
Allow wrapped scoring                   false                                                                                                                                                                                                                                  
E-value threshold                       1e-05
Seq. id. threshold                      0                                                                                                                                                                                                                                      
Min alignment length                    0
Seq. id. mode                           0                                                                                                                                                                                                                                      
Alternative alignments                  0
Coverage threshold                      0                                                                                                                                                                                                                                      
Coverage mode                           0
Max sequence length                     65535                                                                                                                                                                                                                                  
Compositional bias                      1
Compositional bias                      1                                                                                                                                                                                                                                      
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          substitution:1.100,context:1.400
Pseudo count b                          substitution:4.100,context:5.800                                                                                                                                                                                                       
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Correlation score weight                0
Gap open cost                           aa:11,nucl:5
Gap extension cost                      aa:1,nucl:2
Zdrop                                   40
Threads                                 16
Compressed                              0
Verbosity                               3
Seed substitution matrix                aa:VTML80.out,nucl:nucleotide.out
Sensitivity                             6
k-mer length                            0
Target search mode                      0
k-score                                 seq:2147483647,prof:2147483647
Alphabet size                           aa:21,nucl:5
Max results per query                   1000000
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                0
Minimum diagonal score                  15
Selected taxa
Spaced k-mers                           1
Spaced k-mer pattern
Local temporary path
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1                                                                                                                                                                                                               
Profile E-value threshold               0.001                                                                                                                                                                                                                                   
Global sequence weighting               false                                                                                                                                                                                                                                   
Allow deletions                         false                                                                                                                                                                                                                                   
Filter MSA                              1                                                                                                                                                                                                                                       
Use filter only at N seqs               0                                                                                                                                                                                                                                       
Maximum seq. id. threshold              0.9                                                                                                                                                                                                                                     
Minimum seq. id.                        0.0                                                                                                                                                                                                                                     
Minimum score per column                -20                                                                                                                                                                                                                                     
Minimum coverage                        0                                                                                                                                                                                                                                       
Select N most diverse seqs              1000                                                                                                                                                                                                                                    
Pseudo count mode                       0                                                                                                                                                                                                                                       
Min codons in orf                       30                                                                                                                                                                                                                                      
Max codons in length                    32734                                                                                                                                                                                                                                   
Max orf gaps                            2147483647                                                                                                                                                                                                                              
Contig start mode                       2                                                                                                                                                                                                                                       
Contig end mode                         2                                                                                                                                                                                                                                       
Orf start mode                          1                                                                                                                                                                                                                                       
Forward frames                          1,2,3                                                                                                                                                                                                                                   
Reverse frames                          1,2,3                                                                                                                                                                                                                                   
Translation table                       1                                                                                                                                                                                                                                       
Translate orf                           0                                                                                                                                                                                                                                       
Use all table starts                    false                                                                                                                                                                                                                                   
Offset of numeric ids                   0                                                                                                                                                                                                                                       
Create lookup                           0                                                                                                                                                                                                                                       
Add orf stop                            false                                                                                                                                                                                                                                   
Overlap between sequences               0                                                                                                                                                                                                                                       
Sequence split mode                     1                                                                                                                                                                                                                                       
Header split mode                       0                                                                                                                                                                                                                                       
Chain overlapping alignments            0                                                                                                                                                                                                                                       
Merge query                             1                                                                                                                                                                                                                                       
Search type                             0                                                                                                                                                                                                                                       
Search iterations                       5                                                                                                                                                                                                                                       
Start sensitivity                       4                                                                                                                                                                                                                                       
Search steps                            1                                                                                                                                                                                                                                       
Prefilter mode                          0    
Exhaustive search mode                  false                                                                                                                                                                                                                                   
Filter results during exhaustive search 0                                                                                                                                                                                                                                       
Strand selection                        1                                                                                                                                                                                                                                       
LCA search mode                         false                                                                                                                                                                                                                                   
Disk space limit                        0                                                                                                                                                                                                                                       
MPI runner                                                                                                                                                                                                                                                                     
Force restart with latest tmp           false                                                                                                                                                                                                                                  
Remove temporary files                  true                                                                                                                                                                                                                                   
Alignment format                        4                                                                                                                                                                                                                                      
Format alignment output                 query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits                                                                                                                                                        
Database output                         false                                                                                                                                                                                                                                  
Overlap threshold                       0                                                                                                                                                                                                                                      
Database type                           0                                                                                                                                                                                                                                      
Shuffle input database                  true                                                                                                                                                                                                                                   
Createdb mode                           0                                                                                                                                                                                                                                      
Write lookup file                       0                                                                                                                                                                                                                                      
Greedy best hits                        false                                                                                                                                                                                                                                  

createdb s01_complete_refseq_representative_fasta_DEVIDED/mmseqs_rep_d_2.fa tmp1/1465312676443513838/query --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3                                                                         

Converting sequences                                                                                                                                                                                                                                                           
[910] 0s 184ms                                                                                                                                                                                                                                                                 
Time for merging to query_h: 0h 0m 0s 44ms                                                                                                                                                                                                                                     
Time for merging to query: 0h 0m 0s 35ms                                                                                                                                                                                                                                       
Database type: Aminoacid                                                                                                                                                                                                                                                       
Time for processing: 0h 0m 0s 296ms                                                                                                                                                                                                                                            
Create directory tmp1/1465312676443513838/search_tmp                                                                                                                                                                                                                           
search tmp1/1465312676443513838/query mmseqs/mmseqs_clu_rep_db/DB tmp1/1465312676443513838/result tmp1/1465312676443513838/search_tmp --alignment-mode 3 -e 1e-05 -s 6 --max-seqs 1000000 --num-iterations 5 --remove-tmp-files 1                                              

prefilter tmp1/1465312676443513838/query mmseqs/mmseqs_clu_rep_db/DB.idx tmp1/1465312676443513838/search_tmp/12840997425876760019/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 6 -k 0 --target-search-mode 0 --k
-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob
0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 16 --compressed 0 -v 3                                                      

Index version: 16                                                                                                                                                                                                                                                              
Generated by:  bb0a1b3569b9fe115f3bf63e5ba1da234748de23
ScoreMatrix:  VTML80.out                                                                                                                                                                                                                                                       
Query database size: 1000 type: Aminoacid
Estimated memory consumption: 101G
Target database size: 33611651 type: Aminoacid                                                                                                                                                                                                                                 
Process prefiltering step 1 of 1

k-mer similarity threshold: 118                                                                                                                                                                                                                                                
Starting prefiltering scores calculation (step 1 of 1)                                                                                                                                                                                                                         
Query db start 1 to 1000                                                                                                                                                                                                                                                       
Target db start 1 to 33611651                                                                                                                                                                                                                                                  
[=================================================================] 100.00% 1.00K 53s 841ms                                                                                                                                                                                    
tmp1/1465312676443513838/search_tmp/12840997425876760019/blastpgp.sh: line 139: 3819000 Segmentation fault      (core dumped) $RUNNER $PREF "$QUERYDB" "$2" "$TMP_PATH/pref_$STEP" ${TMP}                                                                                      
Error: Prefilter died                                                                                                                                                                                                                                                          
Error: Search died  

Context

I thought maybe it's due to some special symbols in sequences in the failed fastas or larger size of proteins. Seems not, since "X" symbols where in completed fastas as well, as well as protein length ~30K or short. dividing these fasta files to a set of smaller solves this problem for a subset of new, but still leaves some with the same error. I can send an example fasta if it's needed.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

art-egorov commented 11 months ago

UPD: seems all failed proteins are forms of titin. (e.g. XP_035030256.2, XP_035030222.2, XP_045336327.1)

milot-mirdita commented 11 months ago

Does the crash also happen with a smaller max-seqs (currently its set to --max-seqs 1000000)?

Are the failed proteins on the query side? Do these queries also crash against a small DB (e.g. the DB.fasta in the examples folder)?

art-egorov commented 9 months ago

Yep, it also crashes wo --max-seqs parameter and search with these proteins does not crash with search against DB.fasta.