steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
696 stars 92 forks source link

Custom 3di alphabet #196

Closed awfderry closed 8 months ago

awfderry commented 8 months ago

Hi, I'm interested in creating a Foldseek DB for a custom 3di alphabet. Is this possible to do?

For example, I can create a fasta file for the custom sequences that is analogous to the fasta file produced by convert2fasta using the original 3di alphabet, but I'm not sure how to either (1) convert this back to DB format while maintaining compatibility with the AA and CA databases, or (2) create a new DB from scratch given a mapping from PDB file --> custom alphabet sequence.

Thank you!

milot-mirdita commented 8 months ago

You can build a completely custom foldseek database following these instructions: https://github.com/steineggerlab/foldseek/issues/155#issuecomment-1676309878

Here, you just generate matching TSV files for AA, 3Di and header and convert them to a foldseek database.

In case you want to fully retrain 3Di, we should have all required scripts in the following repo: https://github.com/steineggerlab/foldseek-analysis

awfderry commented 8 months ago

Thanks Milot, this is exactly what I'm looking for!