swanss / peptide_design

This repository contains code for the paper: "Tertiary motifs as building blocks for the design of protein-binding peptides"
Other
15 stars 9 forks source link

Averaging structure scores #14

Open proteins247 opened 1 year ago

proteins247 commented 1 year ago

The score structures stage (stage 6) generates structure_scores files and all_fragment_scores/frag_scores files. Per the README and reading some of the closed github issues, the recommendation to score each backbone is by averaging the per residue scores in the structure_scores files.

I wanted to ask, what do you do in the scenario in which some of the residue scores in the structure_scores are 1.79769e+308? i.e. infinity. It seems to be because the number of results (num_results column in the all fragment scores) for a particular peptide residue to protein residue is 0.

One choice is to ignore these infinite-valued scores. Another is to take them as a signal that a peptide residue is not a good match to the start protein and filter out such peptides. I'm not sure.

Another thing: I noticed that the scores you find in the structure_scores files for each peptide residue consists of the sum of the scores in all_fragment_scores for that residue. One peptide residue can have 0, 1, or more than 1 entries in the all fragment scores file, depending on the number of protein residues the peptide residue is scored with. So there are cases where the score in the structure score file for a peptide residue is "infinity," but then that "infinity" is composed of a sum of non-infinity and infinity fragment scores.

For instance, here is a sample from a all_fragment_scores files:

                                      seed seed_res_num  \
218  ../4_samplePaths/path_structures/S...          B13   
222  ../4_samplePaths/path_structures/S...          B13   

    target_res_num  num_results     pair_score       bg_score      seq_score  \
218           A568            5   2.500000e-02   3.688320e-02   3.888800e-01   
222           A569            0  1.797690e+308  1.797690e+308  1.797690e+308   

     contact_score  total_score  
218            0.0      0.38888  
222            NaN          NaN 

Residue 13 is scored twice against protein residues 568 and 569, and the score is infinity in one case. The resultant value in the structure_scores file is infinity. I wonder if I should just drop the scores that are infinity in the all fragment scores file and then recalculate the per residue structure scores?

Finally, I wonder how important scoring is? I did a filter based on the average potential contacts per residue, and I'm down to 400 or so candidate backbones. I wonder if I could just do an RMSD clustering, take a handful of candidate backbones, and try the rest of the design process? I believe I'm designing against a hard target, so I get structure scores like this.

swanss commented 1 year ago

The idea behind scoring is to identify peptide backbones that are compatible with the fixed sequence and structure of the protein target. In this project, we do that by breaking the interface up into peptide-protein fragments and searching those for matches in the PDB. An interface contact will have a good score if the probability of finding the amino acid on the target side is higher than the background (which is just the target fragment alone).

Of course, we can only gather these statistics if there are sufficient matches in the database, so we set the value to +INF to indicate there were not enough matches to get a score. Like you mention, this is evidence that the interface is not forming a common structural motif (e.g. it is not designable) and we chose to discard these from the set of backbone candidates prior to sequence design. If I were you, I would filter out all designs that have a non-designable contact. However, if this is too strict and leaves you with no candidates, you can also try defining a rule where you opt to drop a candidate if it has more than C non-designable contacts and if the backbone design passes the cutoff, drop those contacts prior to averaging over the remaining contacts.

I do believe the scoring is important, as it's the first step where the sequence of the protein target is really taken into account. That being said, protein design/prediction methods have evolved very rapidly over the past year and it might be faster to just design sequence with an ML method (e.g. proteinMPNN) and then evaluate with the rosetta interface analyzer/alphafold.

proteins247 commented 1 year ago

Thanks for the explanation and quick response. I have to think about the next step, then, especially with the mention of proteinMPNN.

Can you clarify about the situation where, in all_fragment_scores, a particular peptide residue has an infinite score with respect to one target protein residue but a non-infinite score with respect to another target protein residue? Is it still the case that the particular peptide residue merits an infinite score?

If I average scores and sort by score, I get 63 non-infinite scores. The lowest scores are 0.36, 0.36, 0.52... One thing that surprises me is that the top few peptide backbones are all alpha helices that are positioned somewhat in the same orientation in the same region of the target protein. Some of the peptides that had the highest number of potential contacts per residue did not score well because of infinite scores for some of their residues.

Looking at the top 20 scoring peptides, 18 are pure alpha helices, while 2 show a helix-coil conformation. Per your advice, I can definitely try to do an RMSD similarity exclusion and design on some of these best-scoring peptides.

htshtshtshts commented 1 month ago

In the sixth step, I also encountered an issue with infinities, where four-fifths of the peptide backbones contained residues with num_results being 0. I noticed that there were also many residues with num_results of 1. If we discard those with 0 matches and retain only those with 1 match, it might seem too strict and potentially lead to the loss of some peptide backbones with good potential, such as those with high scores from other residues. Are there any scripts in the source code for filtering peptide backbones based on their scores (outputting the paths of high-scoring peptide backbones from different clusters, and subsequently incorporating the target protein into their pdb files for dTERMen design)?

Additionally, the contact_score calculated in this step was entirely 0. Does contact_score refer to potential contacts? If not, which step calculates potential contacts? I would like to proceed with the next screening only for peptide backbones with potential contacts greater than 2, and stop further scoring calculations for those that do not meet this criterion. Does the source code support such functionality, and how can it be implemented?

swanss commented 1 month ago

Hi to both of you! Would you be willing to send me your inputs/log/output files for the structure scoring step so that I could inspect them and confirm what's going on? My email is sbstnswnsn@gmail.com

htshtshtshts commented 1 month ago

Thank you very much. I'll organize and send you the input and output files for this step right away. Additionally, I have two more questions:

For the calculation of potential contacts, do I need to add the optional parameter countContacts in the fourth step?

After designing with dTERMen, which outputs an amino acid sequence, is Rosetta Packing used to obtain the full-atom PDB file required for Rosetta relax?

swanss commented 1 month ago

Can you clarify about the situation where, in all_fragment_scores, a particular peptide residue has an infinite score with respect to one target protein residue but a non-infinite score with respect to another target protein residue? Is it still the case that the particular peptide residue merits an infinite score?

@proteins247 My apologies that I missed this specific question! In that case, it means that the geometry of one contact (i.e. a fragment defined from the peptide, target residue in contact and their flanking res) has matches in the structural database, while the other doesn't. My thoughts have changed on this over time. I previously would say to set the residues score to INF, but now I think I would just set the INF contacts to some high "penalty" value (e.g. the max score for any contact) and then average to get the score the peptide.

swanss commented 1 month ago

In the sixth step, I also encountered an issue with infinities, where four-fifths of the peptide backbones contained residues with num_results being 0. I noticed that there were also many residues with num_results of 1. If we discard those with 0 matches and retain only those with 1 match, it might seem too strict and potentially lead to the loss of some peptide backbones with good potential, such as those with high scores from other residues. Are there any scripts in the source code for filtering peptide backbones based on their scores (outputting the paths of high-scoring peptide backbones from different clusters, and subsequently incorporating the target protein into their pdb files for dTERMen design)?

I agree, it's too strict to remove things with zero matches, I would just sort by average score over contacts and take the top N to the next stage. Sorry, I don't have any more scripts for filtering/scoring than what is already in the repo.

Additionally, the contact_score calculated in this step was entirely 0. Does contact_score refer to potential contacts? If not, which step calculates potential contacts? I would like to proceed with the next screening only for peptide backbones with potential contacts greater than 2, and stop further scoring calculations for those that do not meet this criterion. Does the source code support such functionality, and how can it be implemented?

The contact score (it's a terrible generic name, I know) was something different that we experimented with but didn't use in the paper. Each line in the fragscores*.csv is a potential contact, so you can count those after scoring to know how many potential contacts the structure had.

As I mentioned in other issues, I'm working on a new paper that applied a similar approach, but with updated code for scoring (which is considerably faster). I will continue to answer questions for people using this repo, but I will focus on adding new features and options to that one.

htshtshtshts commented 1 month ago

Ah, I see. Thanks for clearing that up. Really excited for your new paper! By the way, in the 'samplePaths' step, I noticed there's an option '--countContacts'. I used it to count the potential contacts and picked out the peptide backbones with more of them for the next steps.