How do I create the FASST file described in the Main pipeline?

swanss / peptide_design

This repository contains code for the paper: "Tertiary motifs as building blocks for the design of protein-binding peptides"

Other

15 stars 9 forks source link

How do I create the FASST file described in the Main pipeline? #4

Closed blacktanktop closed 2 years ago

blacktanktop commented 2 years ago

Hi! I want to run the main pipeline, but I don't know how to create the FASST file described in the main pipeline. How do I create the singlechain_22188_sim30.db or multichain_23643_sim50.db described in the config file in peptide_design/example/input_files/? Could you please give me the inputs and the procedure?

swanss commented 2 years ago

Hi! So you can create your own FASST database file using a program in Mosaist (fasstDB). This program requires that you have a list of PDB files that you'd like to include in the database, ideally these have been pruned for homology. In the supplement we provide the exact list of structures that we used in the single and multichain databases. If you go this route, be sure to use the --s/--c arguments when building the DB. Within the next few days I can add a section to the readme explaining how to do this, step-by-step.

Since that's pretty involved, I'm also going to see about making the databases that we used in the paper available for download. I'm currently looking into the github lfs option. I'll let you know when the file is ready for download.

Thanks for bringing this and the compilation error to my attention!

blacktanktop commented 2 years ago

Thanks for the reply. I read carefully at the supplemental material, and I could confirm that there is a command in Mosaist to create a FASST database called fasstDB. I didn't read it carefully.

However, the list of PDBs used in the paper, which is probably necessary to create the FASST database, is probably listed in the following table, but I cannot find it on the paper's web page. (Table S1-S3 are there, but Table S4-S7 do not appear to be uploaded.)

Table S6: S6_SingleChainDatabase2019-01-22.txt
Table S7: S7_BiologicalUnitsDatabase2019-01-22.txt

It certainly looks a bit difficult to create this FASST database, so I look forward to downloading the database you used in your paper!

swanss commented 2 years ago

Huh, well thanks for pointing that out, I'll reach out to Protein Science and have them add the missing supplementary tables.

I was able to upload the smallest database. You can find it at testfiles/singlechain_22188_sim30_STRIDE.db. This should work well for generating seeds. It will also work for scoring the interface, but it's not quite ideal for that. I'll need to find another way to host larger files to make the multichain database available too.

blacktanktop commented 2 years ago

I would like to perform scoring, but the config setting for example is multichainDB. Does this change the meaning of scoring if it is a singlechain (sorry for not well understanding the paper)? If it has to be multichainDB, would it be possible to upload the multichainDB somewhere as well as the singlechainDB?

https://github.com/swanss/peptide_design/blob/c77b3ebc2ea534b307c291fee732f4215ccf6143/example/6_scoreStructures/run_scoreStructures.sh#L17

swanss commented 2 years ago

Hi, I was able to upload the singlechain/multichain DB files to zenodo. Would you mind downloading both and seeing if they work for you? https://zenodo.org/record/6569429

While I haven't benchmarked it, I do suspect that it's better to score interfaces with the multichain DB. The singlechain DB splits biological units apart, which could have an influence on statistics.

emarcos commented 2 years ago

Hi Sebastian! Really interesting paper. Following up on the previous question, if I understood correctly the results of the paper are based on the singlechainDB. So, I was wondering why in the run_scoreStructures.sh example looks for the multichainDB file. You expect to work better? This is just to ensure the proper way of testing your code.

Thanks!

swanss commented 2 years ago

Hi Enrique!

The singlechainDB is used for generating "interface seeds" which are combined to construct the peptide backbones. When it comes to scoring the interfaces, we opted to use the multichainDB. I just looked back at the paper, and this is only really mentioned in the methods, so I can see how that is confusing.

The TERM interface score works by comparing the probability of the amino acid on the surface of the target protein in the context of a "pair fragment" describing the interface, and a "self fragment" describing just the surface of the protein (which acts as the reference state). The benefit of using the multichainDB is that it will include more interface structures that could be matches to the peptide-protein interface that is being scored. The degree to which this helps is not clear, I haven't benchmarked how the score changes when using the singlechainDB.

emarcos commented 2 years ago

Thanks for the clarification! Just downloaded the multichainDB and will test it on our system.

swanss commented 2 years ago

Seems like the databases I uploaded to Zenodo are working. Feel free to reopen this discussion if you run into any issues.