Closed srom closed 1 year ago
Please redownload the afdb and the latest precompiled foldseek binaries (or compile from source) and use the new --cluster-search 1
parameter. This one already does exactly what you want.
Thanks for the prompt reply!
I am using the latest pre-compiled binaries (7 – 04e0ec8) and indeed I can see the option --cluster-search
.
I downloaded the latest afdb
and I ran the following command:
foldseek easy-search query.pdb afdb results.tsv tmp
--threads 16 \
--prefilter-mode 1 \
--cov-mode 2 \
--cluster-search 1 \
--format-mode 4 \
--format-output query,target,taxid,taxname,taxlineage,fident,tstart,tend,evalue,bits,qtmscore,rmsd
(using 16 cores and 512 GB RAM)
I got an error saying that afdb_seq.dbtype
was required when running with cluster search.
At this point I tried two things:
First, I tried running same query against afdb50
instead. I received error:
Invalid database read for database data file=/path/afdb50_h, database index=/path/afdb50_h.index
getData: local id (4294967295) >= db size (53665860)
Second, I tried to copy afdb50_seq
(as well as _clu
,_seq_ca
and _seq_ss
) to the same folder as the afdb
database and renamed the files appropriately (e.g. afdb_seq
). I ran the initial query again and got the error:
Invalid key 170975519 in entry 0.
Any idea what I am doing wrong?
Following up on this issue, looks like my problem is very similar to issue https://github.com/steineggerlab/foldseek/issues/179. Poster noted that running the job without --format-mode 3
helped, so I tired running mine with no --format-mode
option:
foldseek easy-search query.pdb afdb50 results.tsv tmp
--threads 16 \
--prefilter-mode 1 \
--cov-mode 2 \
--cluster-search 1
However I still run into the same issue with convertalis
:
stdout:
easy-search query.pdb afdb50 results.tsv tmp --threads 16 --prefilter-mode 1 --cov-mode 2 --cluster-search 1
MMseqs Version: 7.04e0ec8
Seq. id. threshold 0
Coverage threshold 0
Coverage mode 2
Max reject 2147483647
Max accept 2147483647
Add backtrace false
TMscore threshold 0
TMalign hit order 0
TMalign fast 1
Preload mode 0
Threads 16
Verbosity 3
LDDT threshold 0
Sort by structure bit score 1
Substitution matrix aa:3di.out,nucl:3di.out
Alignment mode 3
Alignment mode 0
E-value threshold 10
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Max sequence length 65535
Compositional bias 1
Compositional bias 1
Gap open cost aa:10,nucl:10
Gap extension cost aa:1,nucl:1
Compressed 0
Seed substitution matrix aa:3di.out,nucl:3di.out
Sensitivity 9.5
k-mer length 6
k-score seq:2147483647,prof:2147483647
Max results per query 1000
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 0
Mask residues probability 0.99995
Mask lower case residues 1
Minimum diagonal score 30
Selected taxa
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Exhaustive search mode false
Prefilter mode 1
Search iterations 1
Alignment type 2
Remove temporary files true
MPI runner
Force restart with latest tmp false
Cluster search 1
Chain name mode 0
Write mapping file 0
Mask b-factor threshold 0
Coord store mode 2
Write lookup file 1
Tar Inclusion Regex .*
Tar Exclusion Regex ^$
Alignment format 0
Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output false
Greedy best hits false
(...)
convertalis /tmp/93058698540636324/query afdb50 /tmp/93058698540636324/result results.tsv --sub-mat 'aa:3di.out,nucl:3di.out' --format-mode 0 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits --translation-table 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --db-output 0 --db-load-mode 0 --search-type 0 --threads 16 --compressed 0 -v 3
[=================================================================] 1 0s 0ms
Error: Convert Alignments died
stderr:
Invalid database read for database data file=afdb50_h, database index=afdb50_h.index
getData: local id (4294967295) >= db size (53665860)
I have noticed that the version available on conda (also the latest one on github: 7 — 04e0ec8) is not the same as the latest published version on https://mmseqs.com/foldseek (311845db4cd8c4de836f8e8cc87b2cae41352161).
I have installed version 311845db4cd8c4de836f8e8cc87b2cae41352161 instead and now the following command works perfectly fine:
foldseek easy-search query.pdb afdb50 results.tsv tmp
--threads 16 \
--prefilter-mode 1 \
--cov-mode 2 \
--cluster-search 1 \
--format-mode 4
Happy days!
--
However, not unlike issue https://github.com/steineggerlab/foldseek/issues/179, the above command fails if I try to specify a custom list of output columns.
e.g. if I add --format-output query,target,taxid,taxname,taxlineage,fident,alnlen,tstart,tend,tlen,evalue,bits,qtmscore,rmsd
the command fails at the convertalis
step. Error in stderr indicates that database file afdb50_seq_seq_ca
is missing. Sounds like an extra _seq
is being appended somewhere.
I'll close this issue nonetheless, as the bug is covered by https://github.com/steineggerlab/foldseek/issues/179.
We just released a new version of foldseek https://github.com/steineggerlab/foldseek/releases/tag/8-ef4e960 and are in the process to update conda.
I can confirm that the latest version fixes the problem. Thanks!
Hi,
My environment does not have enough RAM to run queries against the full
afdb
so I have been usingafdb50
instead. Searching against this database works perfectly well. However, after the initial search against this cluster database, I would like to expand the search to members of each cluster in my result set. I have not been successful so far.My idea was to to use
foldseek expandaln
for that purpose by following the instructions from the mmseqs2 wiki ("Expand cluster searches"). To that end, I downloadedafdb50clusearch.tar.gz
(as per this recent commit https://github.com/steineggerlab/foldseek/commit/93ad1d44a31fd99a2f51d48f77741fd7fe0a1865) as I thought it may include all the necessary files.Script to reproduce:
(Note: parameter
-a
infoldseek search
was inspired by https://github.com/soedinglab/MMseqs2/issues/650 as previously I got a different error related to missing backtrace in alignment.)The first two commands run fine. However
foldseek expandaln
returns the follwing error:I suspect this is because I do not have the proper
_aln
file. I am usingafdb50_clu
as I thought it might fit the bill but looks like that's not it.Let me know if you need more context. Thanks a lot for your help & for building this great software.
Environment