steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
775 stars 99 forks source link

Access to states #15

Closed phiweger closed 1 year ago

phiweger commented 2 years ago

If I understand correctly, the VQ-VAE used by foldseek translates each amino acid into one of 20 "states". Do we have access to these, i.e. is it possible to get the "state sequence"? Like:

AVGAI -> states 1, 5, 7, 1, 13

Thanks!

martin-steinegger commented 2 years ago

These sequences should be stored in the database with the ending _ss. You can convert it to a fasta file by using

foldseek convert2fasta db_ss db_ss.fasta

mvankem commented 2 years ago

This gives an error at the moment. The workaround is to

mv db tmp
cp db_ss db
foldseek convert2fasta db db_ss.fasta
mv tmp db
phiweger commented 2 years ago

I get the following error:

foldseek convert2fasta queryDB_ss queryDB_ss.fasta

convert2fasta queryDB_ss queryDB_ss.fasta

MMseqs Version: a4983ce31e6e006a29d9d9330ce9f826cd555d3e
Use header DB   false
Verbosity       3

Database queryDB_ss needs header information
milot-mirdita commented 2 years ago

A better workaround should be:

foldseek lndb queryDB_h queryDB_ss_h
foldseek convert2fasta queryDB_ss queryDB_ss.fasta

Note though that we might not keep the state to alphabet letter assignments stable between releases.

sophialeeman commented 2 years ago

Hi! I am running into some issues when trying to find the _ss file. When I run the easy-search command, I believe the command deletes this database. Could you advise me on a different command to run in order to receive this database as an output? I am solely interested in converting pdb files to 3Di sequences.

martin-steinegger commented 2 years ago

You can just use foldseek createdb pdbFolder outputDb to generate the _ss files.

sophialeeman commented 2 years ago

Thank you so much! It works! This was very helpful!

phiweger commented 2 years ago

Is it possible to reverse this, ie, from the states generate the 3D structure? As VAEs are generative models, "something" should come out when you decode the states. Is this possible through the current API?

milot-mirdita commented 2 years ago

At least for the foldseek databases we also store the C-alpha coordinates and also implement PULCHRA within foldseek. So you can get a reasonable backbone back from each foldseek database entry.

I don’t think we have looked into getting a structure back out of the 3Di states yet though.

martin-steinegger commented 1 year ago

We have recently added the --format-mode 5 option to our software, which generates PDB files with all Calpha atoms superimposed based on the aligned coordinates.