soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.44k stars 197 forks source link

easy-search blastx fail #754

Open mewu3 opened 1 year ago

mewu3 commented 1 year ago

Expected Behavior

nuc reads would align to aa DB

Current Behavior

Input tmp1/15265602719800494645/search_tmp/18256837363344165166/search/pref does not exist

Steps to Reproduce (for bugs)

B5177-2-N919D_T1_interleaved.fq.gz DB_TARGET_PATH_NAME is the pfam DB downloaded with mmseqs

mmseqs \
    easy-search \
    <(zcat B5177-2-N919D_T1_interleaved.fq.gz) \
    $DB_TARGET_PATH_NAME \
    B5177-2-N919D_T1.tsv \
    tmp1 \
     \
    --threads 12 \
    --compressed

MMseqs Output (for bugs)

No output

Context

i try another fasta file, things went well. But with the uploaded file as input, i get the error said Input tmp1/15265602719800494645/search_tmp/18256837363344165166/search/pref does not exist, could not figure why.

Your Environment

mmseqs version: 13.45111 conda installed Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-153-generic x86_64)

milot-mirdita commented 1 year ago

I am not sure if <() will work. Please try zcat file | mmseqs easy-search stdin … instead.

milot-mirdita commented 1 year ago

Also mmseqs can read gzip files directly, you shouldn’t need either piping or <()

mewu3 commented 1 year ago

zcat the gzipped fq file before execute mmseqs easy-search change nothing, the error remains the same

easy-search B5177-2-N919D_T1_interleaved.fq pfama_20230914/pfam-a-full B5177-2-N919D_T1.tsv tmp1 --threads 12 --compressed 1 

MMseqs Version:                         13.45111
Substitution matrix                     nucl:nucleotide.out,aa:blosum62.out
Add backtrace                           false
Alignment mode                          3
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          1
Pseudo count b                          1.5
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Gap open cost                           nucl:5,aa:11
Gap extension cost                      nucl:2,aa:1
Zdrop                                   40
Threads                                 12
Compressed                              1
Verbosity                               3
Seed substitution matrix                nucl:nucleotide.out,aa:VTML80.out
Sensitivity                             5.7
k-mer length                            0
k-score                                 2147483647
Alphabet size                           nucl:5,aa:21
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask lower case residues                0
Minimum diagonal score                  15
Spaced k-mers                           1
Spaced k-mer pattern                   
Local temporary path                   
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.001
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Search iterations                       1
Start sensitivity                       4
Search steps                            1
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner                             
Force restart with latest tmp           false
Remove temporary files                  true
Alignment format                        0
Format alignment output                 query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                         false
Overlap threshold                       0
Database type                           0
Shuffle input database                  true
Createdb mode                           0
Write lookup file                       0
Greedy best hits                        false

createdb B5177-2-N919D_T1_interleaved.fq tmp1/15089590555632328148/query --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 1 -v 3 

Converting sequences
[10000] 0s 75ms
Time for merging to query_h: 0h 0m 0s 1ms
Time for merging to query: 0h 0m 0s 1ms
Database type: Nucleotide
Time for processing: 0h 0m 0s 87ms
Create directory tmp1/15089590555632328148/search_tmp
search tmp1/15089590555632328148/query pfama_20230914/pfam-a-full tmp1/15089590555632328148/result tmp1/15089590555632328148/search_tmp --alignment-mode 3 --threads 12 --compressed 1 -s 5.7 --remove-tmp-files 1 

extractorfs tmp1/15089590555632328148/query tmp1/15089590555632328148/search_tmp/16020913209305565279/q_orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --threads 12 --compressed 1 -v 3 

[=================================================================] 100.00% 10.00K 0s 19ms    
Time for merging to q_orfs_aa_h: 0h 0m 0s 0ms
Time for merging to q_orfs_aa: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 27ms
prefilter tmp1/15089590555632328148/search_tmp/16020913209305565279/q_orfs_aa pfama_20230914/pfam-a-full tmp1/15089590555632328148/search_tmp/16020913209305565279/search/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 5.7 -k 5 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 12 --compressed 1 -v 3 

Query database size: 0 type: Aminoacid
Estimated memory consumption: 557M
Target database size: 19482 type: Profile
Index table k-mer threshold: 90 at k-mer size 5 
Index table: counting k-mers
[=================================================================] 100.00% 19.48K 4s 492ms    
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 19.48K 11s 88ms    
Index statistics
Entries:          454766207
DB size:          2633 MB
Avg k-mer size:   111.350382
Top 10 k-mers
    RRRRR       1414
    QQQQQ       1135
    RRRRQ       985
    SPPPP       965
    QRRRR       962
    PPPPS       953
    RQRRR       940
    RRRQR       927
    RHRRR       914
    RKRRR       902
Time for index table init: 0h 0m 16s 838ms
Time for processing: 0h 0m 17s 228ms
swapresults tmp1/15089590555632328148/search_tmp/16020913209305565279/q_orfs_aa pfama_20230914/pfam-a-full tmp1/15089590555632328148/search_tmp/16020913209305565279/search/pref tmp1/15089590555632328148/search_tmp/16020913209305565279/search/pref_swapped --sub-mat nucl:nucleotide.out,aa:blosum62.out -e 0.001 --split-memory-limit 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --threads 12 --compressed 1 --db-load-mode 0 -v 3 

Input tmp1/15089590555632328148/search_tmp/16020913209305565279/search/pref does not exist
Error: Swapresults pref died
Error: Search step died
Error: Search died
milot-mirdita commented 1 year ago

What's the length of the reads? We demand a minimum length for extracted ORFs of 30 AA, you can adjust this with the --min-length. Currently its extracting 0 ORFs and then crashing afterwards, since there are no query sequences.

mewu3 commented 1 year ago

these are 50bp nuc sequence, and yes you are right, setting the --min-length to 10 i get the results, thanks a lot

mewu3 commented 1 year ago

and i got another question, i'm confused with the search-type option of easy-search, i don't know the correspondence with the number. please tell if the following guessing is correct. 0: auto 1: amino acid, BLASTP 2: translated, BLASTX 3: nucleotide, -> BLASTN 4: translated nucleotide alignment -> TBLASTX

milot-mirdita commented 1 year ago

This is a translated search against a protein (profile) db, so this corresponds to blastx/2. (kind of I don’t think there is a actually translated profile search mode in blast). This is closer to something like a translated hmmer search.