steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
780 stars 99 forks source link

Expanding cluster searches with expandaln #177

Closed srom closed 1 year ago

srom commented 1 year ago

Hi,

My environment does not have enough RAM to run queries against the full afdb so I have been using afdb50 instead. Searching against this database works perfectly well. However, after the initial search against this cluster database, I would like to expand the search to members of each cluster in my result set. I have not been successful so far.

My idea was to to use foldseek expandaln for that purpose by following the instructions from the mmseqs2 wiki ("Expand cluster searches"). To that end, I downloaded afdb50clusearch.tar.gz (as per this recent commit https://github.com/steineggerlab/foldseek/commit/93ad1d44a31fd99a2f51d48f77741fd7fe0a1865) as I thought it may include all the necessary files.

Script to reproduce:

foldseek createdb query.pdb query_db --threads 16

foldseek search \
    query_db \
    afdb50 \
    results_db \
    tmp \
    -a 1 \
    --threads 16 \
    --prefilter-mode 1 \
    --cov-mode 2

foldseek expandaln \
    query_db \
        afdb50 \
    results_db \
    afdb50_clu \   # this is probably the issue
    results_expanded \
    --threads 16 \
    --cov-mode 2

(Note: parameter -a in foldseek search was inspired by https://github.com/soedinglab/MMseqs2/issues/650 as previously I got a different error related to missing backtrace in alignment.)

The first two commands run fine. However foldseek expandaln returns the follwing error:

Invalid alignment result record.

I suspect this is because I do not have the proper _aln file. I am using afdb50_clu as I thought it might fit the bill but looks like that's not it.

Let me know if you need more context. Thanks a lot for your help & for building this great software.

Environment

milot-mirdita commented 1 year ago

Please redownload the afdb and the latest precompiled foldseek binaries (or compile from source) and use the new --cluster-search 1 parameter. This one already does exactly what you want.

srom commented 1 year ago

Thanks for the prompt reply!

I am using the latest pre-compiled binaries (7 – 04e0ec8) and indeed I can see the option --cluster-search.

I downloaded the latest afdb and I ran the following command:

foldseek easy-search query.pdb afdb results.tsv tmp
    --threads 16 \
    --prefilter-mode 1 \
    --cov-mode 2 \
    --cluster-search 1 \
    --format-mode 4 \
        --format-output query,target,taxid,taxname,taxlineage,fident,tstart,tend,evalue,bits,qtmscore,rmsd

(using 16 cores and 512 GB RAM)

I got an error saying that afdb_seq.dbtype was required when running with cluster search.

At this point I tried two things:

First, I tried running same query against afdb50 instead. I received error:

Invalid database read for database data file=/path/afdb50_h, database index=/path/afdb50_h.index
getData: local id (4294967295) >= db size (53665860)

Second, I tried to copy afdb50_seq (as well as _clu,_seq_ca and _seq_ss) to the same folder as the afdb database and renamed the files appropriately (e.g. afdb_seq). I ran the initial query again and got the error:

Invalid key 170975519 in entry 0.

Any idea what I am doing wrong?

srom commented 1 year ago

Following up on this issue, looks like my problem is very similar to issue https://github.com/steineggerlab/foldseek/issues/179. Poster noted that running the job without --format-mode 3 helped, so I tired running mine with no --format-mode option:

foldseek easy-search query.pdb afdb50 results.tsv tmp
    --threads 16 \
    --prefilter-mode 1 \
    --cov-mode 2 \
    --cluster-search 1

However I still run into the same issue with convertalis:

stdout:

easy-search query.pdb afdb50 results.tsv tmp --threads 16 --prefilter-mode 1 --cov-mode 2 --cluster-search 1 

MMseqs Version:                 7.04e0ec8
Seq. id. threshold              0
Coverage threshold              0
Coverage mode                   2
Max reject                      2147483647
Max accept                      2147483647
Add backtrace                   false
TMscore threshold               0
TMalign hit order               0
TMalign fast                    1
Preload mode                    0
Threads                         16
Verbosity                       3
LDDT threshold                  0
Sort by structure bit score     1
Substitution matrix             aa:3di.out,nucl:3di.out
Alignment mode                  3
Alignment mode                  0
E-value threshold               10
Min alignment length            0
Seq. id. mode                   0
Alternative alignments          0
Max sequence length             65535
Compositional bias              1
Compositional bias              1
Gap open cost                   aa:10,nucl:10
Gap extension cost              aa:1,nucl:1
Compressed                      0
Seed substitution matrix        aa:3di.out,nucl:3di.out
Sensitivity                     9.5
k-mer length                    6
k-score                         seq:2147483647,prof:2147483647
Max results per query           1000
Split database                  0
Split mode                      2
Split memory limit              0
Diagonal scoring                true
Exact k-mer matching            0
Mask residues                   0
Mask residues probability       0.99995
Mask lower case residues        1
Minimum diagonal score          30
Selected taxa                   
Spaced k-mers                   1
Spaced k-mer pattern            
Local temporary path            
Exhaustive search mode          false
Prefilter mode                  1
Search iterations               1
Alignment type                  2
Remove temporary files          true
MPI runner                      
Force restart with latest tmp   false
Cluster search                  1
Chain name mode                 0
Write mapping file              0
Mask b-factor threshold         0
Coord store mode                2
Write lookup file               1
Tar Inclusion Regex             .*
Tar Exclusion Regex             ^$
Alignment format                0
Format alignment output         query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                 false
Greedy best hits                false

(...)

convertalis /tmp/93058698540636324/query afdb50 /tmp/93058698540636324/result results.tsv --sub-mat 'aa:3di.out,nucl:3di.out' --format-mode 0 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits --translation-table 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --db-output 0 --db-load-mode 0 --search-type 0 --threads 16 --compressed 0 -v 3 

[=================================================================] 1 0s 0ms
Error: Convert Alignments died

stderr:

Invalid database read for database data file=afdb50_h, database index=afdb50_h.index
getData: local id (4294967295) >= db size (53665860)
srom commented 1 year ago

I have noticed that the version available on conda (also the latest one on github: 7 — 04e0ec8) is not the same as the latest published version on https://mmseqs.com/foldseek (311845db4cd8c4de836f8e8cc87b2cae41352161).

I have installed version 311845db4cd8c4de836f8e8cc87b2cae41352161 instead and now the following command works perfectly fine:

foldseek easy-search query.pdb afdb50 results.tsv tmp
    --threads 16 \
    --prefilter-mode 1 \
    --cov-mode 2 \
    --cluster-search 1 \
    --format-mode 4

Happy days!

--

However, not unlike issue https://github.com/steineggerlab/foldseek/issues/179, the above command fails if I try to specify a custom list of output columns.

e.g. if I add --format-output query,target,taxid,taxname,taxlineage,fident,alnlen,tstart,tend,tlen,evalue,bits,qtmscore,rmsd the command fails at the convertalis step. Error in stderr indicates that database file afdb50_seq_seq_ca is missing. Sounds like an extra _seq is being appended somewhere.

I'll close this issue nonetheless, as the bug is covered by https://github.com/steineggerlab/foldseek/issues/179.

martin-steinegger commented 1 year ago

We just released a new version of foldseek https://github.com/steineggerlab/foldseek/releases/tag/8-ef4e960 and are in the process to update conda.

srom commented 1 year ago

I can confirm that the latest version fixes the problem. Thanks!