Averaging structure scores

proteins247 commented 1 year ago

The score structures stage (stage 6) generates structure_scores files and all_fragment_scores/frag_scores files. Per the README and reading some of the closed github issues, the recommendation to score each backbone is by averaging the per residue scores in the structure_scores files.

I wanted to ask, what do you do in the scenario in which some of the residue scores in the structure_scores are 1.79769e+308? i.e. infinity. It seems to be because the number of results (num_results column in the all fragment scores) for a particular peptide residue to protein residue is 0.

One choice is to ignore these infinite-valued scores. Another is to take them as a signal that a peptide residue is not a good match to the start protein and filter out such peptides. I'm not sure.

Another thing: I noticed that the scores you find in the structure_scores files for each peptide residue consists of the sum of the scores in all_fragment_scores for that residue. One peptide residue can have 0, 1, or more than 1 entries in the all fragment scores file, depending on the number of protein residues the peptide residue is scored with. So there are cases where the score in the structure score file for a peptide residue is "infinity," but then that "infinity" is composed of a sum of non-infinity and infinity fragment scores.

For instance, here is a sample from a all_fragment_scores files:

                                      seed seed_res_num  \
218  ../4_samplePaths/path_structures/S...          B13   
222  ../4_samplePaths/path_structures/S...          B13   

    target_res_num  num_results     pair_score       bg_score      seq_score  \
218           A568            5   2.500000e-02   3.688320e-02   3.888800e-01   
222           A569            0  1.797690e+308  1.797690e+308  1.797690e+308   

     contact_score  total_score  
218            0.0      0.38888  
222            NaN          NaN

Residue 13 is scored twice against protein residues 568 and 569, and the score is infinity in one case. The resultant value in the structure_scores file is infinity. I wonder if I should just drop the scores that are infinity in the all fragment scores file and then recalculate the per residue structure scores?

Finally, I wonder how important scoring is? I did a filter based on the average potential contacts per residue, and I'm down to 400 or so candidate backbones. I wonder if I could just do an RMSD clustering, take a handful of candidate backbones, and try the rest of the design process? I believe I'm designing against a hard target, so I get structure scores like this.

swanss commented 1 year ago

The idea behind scoring is to identify peptide backbones that are compatible with the fixed sequence and structure of the protein target. In this project, we do that by breaking the interface up into peptide-protein fragments and searching those for matches in the PDB. An interface contact will have a good score if the probability of finding the amino acid on the target side is higher than the background (which is just the target fragment alone).

Of course, we can only gather these statistics if there are sufficient matches in the database, so we set the value to +INF to indicate there were not enough matches to get a score. Like you mention, this is evidence that the interface is not forming a common structural motif (e.g. it is not designable) and we chose to discard these from the set of backbone candidates prior to sequence design. If I were you, I would filter out all designs that have a non-designable contact. However, if this is too strict and leaves you with no candidates, you can also try defining a rule where you opt to drop a candidate if it has more than C non-designable contacts and if the backbone design passes the cutoff, drop those contacts prior to averaging over the remaining contacts.

I do believe the scoring is important, as it's the first step where the sequence of the protein target is really taken into account. That being said, protein design/prediction methods have evolved very rapidly over the past year and it might be faster to just design sequence with an ML method (e.g. proteinMPNN) and then evaluate with the rosetta interface analyzer/alphafold.

proteins247 commented 1 year ago

Thanks for the explanation and quick response. I have to think about the next step, then, especially with the mention of proteinMPNN.

Can you clarify about the situation where, in all_fragment_scores, a particular peptide residue has an infinite score with respect to one target protein residue but a non-infinite score with respect to another target protein residue? Is it still the case that the particular peptide residue merits an infinite score?

If I average scores and sort by score, I get 63 non-infinite scores. The lowest scores are 0.36, 0.36, 0.52... One thing that surprises me is that the top few peptide backbones are all alpha helices that are positioned somewhat in the same orientation in the same region of the target protein. Some of the peptides that had the highest number of potential contacts per residue did not score well because of infinite scores for some of their residues.

Looking at the top 20 scoring peptides, 18 are pure alpha helices, while 2 show a helix-coil conformation. Per your advice, I can definitely try to do an RMSD similarity exclusion and design on some of these best-scoring peptides.

swanss / peptide_design

Averaging structure scores #14