steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
814 stars 100 forks source link

Choosing subset of entries using custom list of IDs for foldseek createDB step #97

Open anandksrao opened 1 year ago

anandksrao commented 1 year ago

Context:

I want to search for structure matches ONLY for "representative" proteins in the proteome of my species of interest, and I would rather build the searchable database using just these IDs, rather than parse foldseek easy-search results - in other words, avoid post-processing.

Question 1:

After downloading species-specific sharded tar files of AlphaFold2 database, using these instructions, is it possible to supply to foldseek, during the createdb step, a CUSTOM LIST of UniProt IDs that will allow the database to be built ONLY for these user supplied IDs?

The help menu for foldseek createdb has an option each to include or exclude file names using regex, but is it possible to include or exclude file names based on an entire custom list of names that cannot be summarized via regex?

Answering my OWN question:

See earlier post from 2023 Jan 20 at https://github.com/steineggerlab/foldseek/issues/77

Specifically the syntax in this post by user endixk

awk 'NR == FNR {f[$1] = $1; next} $2 in f {print $1}' subset.txt afdb.lookup > subset.lookup
foldseek createsubdb subset.lookup afdb afdb_subset
foldseek createsubdb subset.lookup afdb_ss afdb_subset_ss
foldseek createsubdb subset.lookup afdb_ca afdb_subset_ca
rm subset.lookup

For my goals

Step 1: foldseek createdb
Step 2: foldseek createsubdb
Step 3: foldseek easy-search / foldseek search

Question 2:

In any foldseek database created using foldseek createdb, how can I count the number of entries (individual proteins) in that database? Is the number of entries used in createdb = number of files in the *.source file?

Expected Behavior

not applicable, all steps execute OK

Current Behavior

not applicable, all steps execute OK

Steps to Reproduce

not applicable, all steps execute OK

Foldseek Output (for bugs)

not applicable, all steps execute OK

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Environment

kad-ecoli commented 1 year ago

For Question 2, suppose the target database is called targetDB, the number of database proteins can be obtained by

cat targetDB.lookup |wc -l
anandksrao commented 1 year ago

Thanks for your reply, Chengxin Zhang.

Observation 1

I notice that in a foldseek database built using createdb using either *.source file (as reported by me) or using the *.lookup file (as suggested by you) yield the same result in terms of number of lines in the file

Observation 2

Download ARATH AF2 database from this AF2 database link This link is available on the AF2 downloads home page for multiple species at this link After completing the foldseek createdb step for this AF2 database, peeking into the *.lookup file as suggested by you

$ head ARATH_foldseek/ARATH_AF2_foldseekDB_2023Mar31.lookup
0   AF-A0A0A7EPL0-F1-model_v4.cif.gz    0
1   AF-A0A140JWM8-F1-model_v4.pdb.gz    1
2   AF-A0A0A7EPL0-F1-model_v4.pdb.gz    2
3   AF-A0A140JWM8-F1-model_v4.cif.gz    3
4   AF-A0A178U9T0-F1-model_v4.cif.gz    4
5   AF-A0A178U7J2-F1-model_v4.pdb.gz    7
6   AF-A0A178U9T0-F1-model_v4.pdb.gz    5
7   AF-A0A178UFC4-F1-model_v4.pdb.gz    6
8   AF-A0A178UFC4-F1-model_v4.cif.gz    8
9   AF-A0A178ULY3-F1-model_v4.pdb.gz    9

Here you can see that lines 0 and 2 are both for same UniProt ID A0A0A7EPL0, one in PDB another in CIF format. Likewise, lines 1 and 3 both correspond to UniProt ID A0A140JWM8, one in PDB another in CIF format...

Basically, in both the .source and .lookup files, there are exactly equal number of PDB versus CIF files in gz format.

$ cat ARATH_foldseek/ARATH_AF2_foldseekDB_2023Mar31.lookup | grep pdb | wc -l
   27434
$ cat ARATH_foldseek/ARATH_AF2_foldseekDB_2023Mar31.lookup | grep cif | wc -l
   27434
$ cat ARATH_foldseek/ARATH_AF2_foldseekDB_2023Mar31.source | grep cif | wc -l
   27434
$ cat ARATH_foldseek/ARATH_AF2_foldseekDB_2023Mar31.source | grep pdb | wc -l
   27434

In fact, at https://alphafold.ebi.ac.uk/download#proteomes-section, specifically for Arabidopsis thaliana (as an example), the number of predicted structures is listed as 27434, which is the exact same number of PDBs or CIFs in my built up database for this species.

This means the number of entries = number of lines / 2 for any given database. Do you agree?

Question 1

In the .lookup file generated during the createdb step, what are the 1st and 3rd columns for? And why do contents of columns 1 and 3 not match for all rows of this file (example above)?

Question 2

Is there a foldseek utility that can "extract" structure info in PDB or CIF format, for just the start - stop alignment coordinates returned by foldseek searchor foldseek easy-search, as a subset of the full length structure prediction? If yes, which foldseek tool? If not, are there non-foldseek tools that may be used for this?

Hopefully someone from the development team or experienced foldseek users to answer these questions. Thanks in advance!

diegozea commented 1 year ago

Hi! I also wonder about the answer to question 2. I have found in https://github.com/steineggerlab/foldseek/issues/68 that --format-mode 5 allows to output the target structures in PDB format. However, --format-mode 5 return the whole chain and not only the aligned region. This is ok, but the problem is that it does not output the table to know the coordinates to cut the structures. Therefore, at the moment, I am running Foldseek two times; one to get the coordinates and the other to get the structures. Is there a better way? Thanks in advance, Cheers

milot-mirdita commented 10 months ago

You have to call createdb -> search and then convertalis twice with the two different output modes you want (superposed structures and BLAST-tab).

diegozea commented 9 months ago

Thanks @milot-mirdita! I am doing that right now 😄

gieses commented 7 months ago

Hi! I also wonder about the answer to question 2. I have found in #68 that --format-mode 5 allows to output the target structures in PDB format. However, --format-mode 5 return the whole chain and not only the aligned region. This is ok, but the problem is that it does not output the table to know the coordinates to cut the structures. Therefore, at the moment, I am running Foldseek two times; one to get the coordinates and the other to get the structures. Is there a better way? Thanks in advance, Cheers

with coordinates you mean the alignment start and end?

diegozea commented 7 months ago

Yes, exactly. I wanted the start and end position in the target structure to extract the aligned part.