Open anandksrao opened 1 year ago
For Question 2, suppose the target database is called targetDB, the number of database proteins can be obtained by
cat targetDB.lookup |wc -l
Thanks for your reply, Chengxin Zhang.
I notice that in a foldseek
database built using createdb
using either *.source
file (as reported by me) or using the *.lookup
file (as suggested by you)
yield the same result in terms of number of lines in the file
Download ARATH AF2 database from this AF2 database link
This link is available on the AF2 downloads home page for multiple species at this link
After completing the foldseek createdb
step for this AF2 database, peeking into the *.lookup
file as suggested by you
$ head ARATH_foldseek/ARATH_AF2_foldseekDB_2023Mar31.lookup
0 AF-A0A0A7EPL0-F1-model_v4.cif.gz 0
1 AF-A0A140JWM8-F1-model_v4.pdb.gz 1
2 AF-A0A0A7EPL0-F1-model_v4.pdb.gz 2
3 AF-A0A140JWM8-F1-model_v4.cif.gz 3
4 AF-A0A178U9T0-F1-model_v4.cif.gz 4
5 AF-A0A178U7J2-F1-model_v4.pdb.gz 7
6 AF-A0A178U9T0-F1-model_v4.pdb.gz 5
7 AF-A0A178UFC4-F1-model_v4.pdb.gz 6
8 AF-A0A178UFC4-F1-model_v4.cif.gz 8
9 AF-A0A178ULY3-F1-model_v4.pdb.gz 9
Here you can see that lines 0 and 2 are both for same UniProt ID A0A0A7EPL0, one in PDB another in CIF format. Likewise, lines 1 and 3 both correspond to UniProt ID A0A140JWM8, one in PDB another in CIF format...
Basically, in both the .source and .lookup files, there are exactly equal number of PDB versus CIF files in gz format.
$ cat ARATH_foldseek/ARATH_AF2_foldseekDB_2023Mar31.lookup | grep pdb | wc -l
27434
$ cat ARATH_foldseek/ARATH_AF2_foldseekDB_2023Mar31.lookup | grep cif | wc -l
27434
$ cat ARATH_foldseek/ARATH_AF2_foldseekDB_2023Mar31.source | grep cif | wc -l
27434
$ cat ARATH_foldseek/ARATH_AF2_foldseekDB_2023Mar31.source | grep pdb | wc -l
27434
In fact, at https://alphafold.ebi.ac.uk/download#proteomes-section, specifically for Arabidopsis thaliana (as an example), the number of predicted structures is listed as 27434, which is the exact same number of PDBs or CIFs in my built up database for this species.
This means the number of entries = number of lines / 2
for any given database. Do you agree?
In the .lookup file generated during the createdb step, what are the 1st and 3rd columns for? And why do contents of columns 1 and 3 not match for all rows of this file (example above)?
Is there a foldseek
utility that can "extract" structure info in PDB or CIF format, for just the start - stop alignment coordinates returned by foldseek search
or foldseek easy-search
, as a subset of the full length structure prediction? If yes, which foldseek
tool? If not, are there non-foldseek tools that may be used for this?
Hopefully someone from the development team or experienced foldseek
users to answer these questions. Thanks in advance!
Hi! I also wonder about the answer to question 2. I have found in https://github.com/steineggerlab/foldseek/issues/68 that --format-mode 5
allows to output the target structures in PDB format. However, --format-mode 5
return the whole chain and not only the aligned region. This is ok, but the problem is that it does not output the table to know the coordinates to cut the structures. Therefore, at the moment, I am running Foldseek two times; one to get the coordinates and the other to get the structures. Is there a better way? Thanks in advance, Cheers
You have to call createdb
-> search
and then convertalis
twice with the two different output modes you want (superposed structures and BLAST-tab).
Thanks @milot-mirdita! I am doing that right now 😄
Hi! I also wonder about the answer to question 2. I have found in #68 that
--format-mode 5
allows to output the target structures in PDB format. However,--format-mode 5
return the whole chain and not only the aligned region. This is ok, but the problem is that it does not output the table to know the coordinates to cut the structures. Therefore, at the moment, I am running Foldseek two times; one to get the coordinates and the other to get the structures. Is there a better way? Thanks in advance, Cheers
with coordinates you mean the alignment start and end?
Yes, exactly. I wanted the start and end position in the target structure to extract the aligned part.
Context:
I want to search for structure matches ONLY for "representative" proteins in the proteome of my species of interest, and I would rather build the searchable database using just these IDs, rather than parse
foldseek easy-search
results - in other words, avoid post-processing.Question 1:
After downloading species-specific sharded tar files of AlphaFold2 database, using these instructions, is it possible to supply to
foldseek
, during thecreatedb
step, a CUSTOM LIST of UniProt IDs that will allow the database to be built ONLY for these user supplied IDs?The help menu for
foldseek createdb
has an option each to include or exclude file names using regex, but is it possible to include or exclude file names based on an entire custom list of names that cannot be summarized via regex?Answering my OWN question:
See earlier post from 2023 Jan 20 at https://github.com/steineggerlab/foldseek/issues/77
Specifically the syntax in this post by user endixk
For my goals
Question 2:
In any
foldseek
database created usingfoldseek createdb
, how can I count the number of entries (individual proteins) in that database? Is the number of entries used increatedb
= number of files in the*.source
file?Expected Behavior
not applicable, all steps execute OK
Current Behavior
not applicable, all steps execute OK
Steps to Reproduce
not applicable, all steps execute OK
Foldseek Output (for bugs)
not applicable, all steps execute OK
Context
Providing context helps us come up with a solution and improve our documentation for the future.
Environment