steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
831 stars 103 forks source link

Input taxonomy database is missing when using createdb #276

Open kchennen opened 6 months ago

kchennen commented 6 months ago

Hello, I am having issues to run foldseek easy-search on my custom database with only yeasts AF2 pdb files. I think that the problem might be because I am asking for the "taxid,taxname" columns in the output format. How can I create these missing files for my custom target database? Could you please help me?

Current Behavior

Steps to Reproduce (for bugs)

# Create target database with only S. cerevisiae AF2 pdb files
foldseek createdb ${PROJECT_DIR}/yeast_proteome_af2/ ${FOLDSEEK_YEAST_DIR}/yeastDB --write-mapping 1

# Generates and stores the index on disk
foldseek createindex ${FOLDSEEK_YEAST_DIR}/yeastDB ${FOLDSEEK_YEAST_DIR}/tmp 
foldseek easy-search \
    -k 3 \
    --exhaustive-search 1 \
    --remove-tmp-files 0 \
    --format-mode 4 \
        --format-output query,target,theader,taxid,taxname,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,qcov,tcov,evalue,ttmscore,alntmscore,bits,rmsd,prob \
    ${INPUT_DIR}/P39720_OAF1_MHD.pdb \
    ${FOLDSEEK_YEAST_DIR}/yeastDB  \
    ${OUTPUT_DIR}/P39720_k3_exhaustive.tsv \
    ${OUTPUT_DIR}/tmp_P39720_k3_exhaustive 

Foldseek Output (for bugs)

MMseqs Version:                 9.427df8a
Seq. id. threshold              0
Coverage threshold              0
Coverage mode                   0
Max reject                      2147483647
Max accept                      2147483647
Add backtrace                   false
TMscore threshold               0
TMalign hit order               0
TMalign fast                    1
Preload mode                    0
Threads                         192
Verbosity                       3
LDDT threshold                  0
Sort by structure bit score     1
Alignment type                  2
Exact TMscore                   0
Substitution matrix             aa:3di.out,nucl:3di.out
Alignment mode                  3
Alignment mode                  0
E-value threshold               10
Min alignment length            0
Seq. id. mode                   0
Alternative alignments          0
Max sequence length             65535
Compositional bias              1
Compositional bias              1
Gap open cost                   aa:10,nucl:10
Gap extension cost              aa:1,nucl:1
Compressed                      0
Seed substitution matrix        aa:3di.out,nucl:3di.out
Sensitivity                     9.5
k-mer length                    3
Target search mode              0
k-score                         seq:2147483647,prof:2147483647
Max results per query           1000
Split database                  0
Split mode                      2
Split memory limit              0
Diagonal scoring                true
Exact k-mer matching            0
Mask residues                   0
Mask residues probability       0.99995
Mask lower case residues        1
Minimum diagonal score          30
Selected taxa                   
Spaced k-mers                   1
Spaced k-mer pattern            
Local temporary path            
Exhaustive search mode          true
Prefilter mode                  0
Search iterations               1
Remove temporary files          false
MPI runner                      
Force restart with latest tmp   false
Cluster search                  0
Path to ProstT5                 
Chain name mode                 0
Write mapping file              0
Mask b-factor threshold         0
Coord store mode                2
Write lookup file               1
Input format                    0
File Inclusion Regex            .*
File Exclusion Regex            ^$
Alignment format                4
Format alignment output         query,target,theader,taxid,taxname,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,qcov,tcov,evalue,ttmscore,alntmscore,bits,rmsd,prob
Database output                 false
Greedy best hits                false

Alignment backtraces will be computed, since they were requested by output format.
Input taxonomy database "/tempor/kchennen/CEGAL/data/1_interim/foldseek_yeast/yeastDB" is missing files:
- /tempor/kchennen/CEGAL/data/1_interim/foldseek_yeast/yeastDB_nodes.dmp
- /tempor/kchennen/CEGAL/data/1_interim/foldseek_yeast/yeastDB_names.dmp
- /tempor/kchennen/CEGAL/data/1_interim/foldseek_yeast/yeastDB_merged.dmp
milot-mirdita commented 5 months ago

Automatic taxid extraction works only for mmCIF files that have one of the following three fields somewhere:

As PDB files don't commonly contain (easily extractable) taxonomy information we don't try to read that information.

Doing this by hand is possible however a bit more involved. See the following MMseqs2 wiki section: https://github.com/soedinglab/MMseqs2/wiki#create-a-seqtaxdb-by-manual-annotation-of-a-sequence-database