Extend to support non-canonical amino acids

Rosetta is really hard to use with non-canonical amino acids (NCAA). Would adding non-canonical amino acid support to this project be an acceptable extension? I realize it's quite ambitious, but it might be easier to do now at this early stage of the project.

From what I understand, this ambitious goal would require:

Changing the grammar in resources/Peptides/grammars.jl and related code to support NCAA character representation if they are used
Adding backbone rotamers for the NCAA in a similar format to resources/Peptides/dunbrack_rotamers.lib
Adding YAML files for non-natural amino acids

I don't have much experience with NCAA but yes, the idea behind the usage of Grammars, is to be able to add non-canonical amino acids. We have done this in a previous iteration of the code, so I don't think it would be an extension too ambitious to pursue, at all. I believe you got all the requirements correct, those would be the necessary modifications. Do you have an example of a NCAA PDB file?

I have an update for you @Seanny123!

So, I've started to work on NCAA support for ProtoSyn. As I've state before, I didn't had to change much for this to work. I didn't know which sort of NCAA you wanted to implement, so I started with Coronamic Acid (seemed interesting):

Since I couldn't find a structural file for this NCAA (or any other, for that matter) I used molview.org to draw a starting point. It allowed me to download a cma.mol file, which I loaded in PyMOL simply to convert to a cma.pdb file with CONECT records.

I loaded this file directly into ProtoSyn (using ProtoSyn.Peptides.load("cma.pdb")) and created a quick minimization protocol (I used the TorchANI energy function component). Here's the code:

pose = ProtoSyn.Peptides.load("cma.pdb")
energy_function = ProtoSyn.Calculators.EnergyFunction([
    ProtoSyn.Calculators.TorchANI.get_default_torchani_model(),
    ProtoSyn.Calculators.Restraints.get_default_bond_distance_restraint()
])
callback = ProtoSyn.Common.default_energy_step_callback(10)
minimize = ProtoSyn.Drivers.SteepestDescent(energy_function, callback, 3000, 0.001, 0.1)
minimize(pose)

This was optional, but I wanted to make sure the structure was more or less correct.

In order to use the NCAA in a Grammar, for mutation and peptide building from sequence, there's only 1 requirement: the backbone atom names. Essentially, the reason you want support for NCAA is to include them in design efforts (i.e.: do mutations). ProtoSyn, during the mutation process, uses atom names to define 2 things:

What is the sidechain and what is the backbone
What atoms to connect (in the Peptides default grammar we define that atom C of a residue connects upstream to atom N - the peptidic bond).

As such, the only requirement for a new NCAA to be added to the Peptides default grammar is that the backbone atoms are named N, H, CA, C and O. All other atoms must be named something else (such as C1, C2, N1, etc). The rest of the "minimization" script tried to automatically infer which were the backbone atoms (the best it could ahah). I can share the rest of the script with you, if you'd like, but I feel like it is quite complicated for no reason: simply defining the correct atom names manually is much faster and accurate, in most cases.

Next, all I had to do is save the Pose in YML format (using ProtoSyn.write(pose, "cma.yml")) and that's it. I went ahead and created a new NCAA/ directory in the resources/Peptides folder to add new YML files. I also changed the grammars.yml to include an ncaa entry. This streamlines the usage of grammars without manually adding the fragments: simply toss them into the NCAA/ folder and add them to the grammars.yml variables entry.

You can find this CMA example in the 2d61aeb6e6be738ed188d1a1f880ee1e5a02f7cb commit of the dev/ branch.

In order to use the newly defined NCAA, all you need to do is:

pose = ProtoSyn.Peptides.load("2a3d.pdb")
res_lib = ProtoSyn.load_grammar_from_file(Float64, joinpath(Peptides.resource_dir, "grammars.yml"), "ncaa")
ProtoSyn.Peptides.mutate!(pose, pose.graph[1, 2], res_lib, ["CMA"])

The result: ncaa

And both the Graph and State are correct, we can rotate the structure and perform simulations as usual: ezgif com-gif-maker (Sorry for the bad quality)

I'll continue to work on NCAA support when possible. There are still some edges to perfect:

Even though the backbone dihedral angles can still be correctly identified (phi and psi), ProtoSyn has no way of knowing which are the chi angles. For the natural aminoacids, this is defined in ProtoSyn.Peptides.Dihedral.chi_dict (in the Peptides/types.jl file). It seems the easiest way to define the new chi angles is to add a way to load this information from an external source. Another (harder) option is to infer which bonds are movable automatically.
Besides defining which atoms belong to which chi angle, its also useful to use rotamer libraries. Currently, ProtoSyn uses the Backbone Dependent Dunbrack library (using ProtoSyn.Peptides.load_dunbrack()). For NCAA, most likely, these rotamers libraries are not available, so it might be useful to add methods to create our own (simply rotate all identified chi angles and measure the resulting energy according to some energy function - the TorchANI, for example). Backbone dependence, however, is mostly a statistical information - Dunbrack and his team measured chi angles in thousands of PDB entries to determine likelihood of a certain chi angle to exist, depending on the backbone phi and psi angles of that particular residue. For this reason, it seems unlikely that we'll have backbone dependent rotamer libraries for NCAA.

I hope this was helpful, and if you need any sort of NCAA sorted out, in particular, just let me know, I'd be happy to help.

Next update: identifying chi dihedrals.

You can find these changes at c8bb5957d0fa8de2aa1e2efff70a3662cb51ae88.

As I've explained in the previous post on this issue, in order to introduce more exotic aminoacids we would require a way to identify the chi dihedrals, either by loading their definition from an external source or by automatically identifying which dihedrals have rotational freedom. For this, I've selected the first approach. In either case, automatic identification can still be performed as a pre-step, writting this information to a source which is then loaded into ProtoSyn. For now, I've focused on the loading part of this problem, and have manually identified dihedral with rotational freedom.

The approach I took introduces a new entry in the grammar .yml files, defining which atoms control the chi dihedrals. I've done this to all entries, including the canonical aminoacids. As such, the _Peptides.chidict and _Peptides.availableaminoacids only get populated once a grammar is loaded. Bofore, no grammar was loaded by default, the user would have to select one and load it (for example, using Peptides.grammar()). With these changes, the default grammar (with the canonical aminoacids) gets loaded by default, and is acessible by using the Peptides.grammar variable. Any additional loading of extra grammars adds content to the _Peptides.chidict and _Peptides.availableaminoacids variables. For the case of NCAA, this can be done using Peptides.load_grammar_from_file("grammars.yml", "ncaa").

As such, chi dihedrals, in NCAA, are now accessible and rotatable in ProtoSyn, as exemplified bellow:

ezgif com-gif-maker (1)

This also means that developing the necessary rotamer libraries is now possible. I've introduced some changes in this field as well: RotamerStack instances are now called BBI_RotamerLibrary (i.e.: backbone independent rotamer library). BBD_RotamerLibrary instances (backbone dependent rotamer libraries) are, is essence, a collection of backbone independent instances, grouped by backbone phi and psi angles.

In this commit, you'll also find the resources/Peptides/calc_bbi_rot_lib.jl file, which rotates all available chi dihedrals of a NCAA and evaluates each rotamer energy according to a given energy function (in this case, I've used the TorchANI). The results are, therefore, all the local minima found in this N-dimensional matrix. The script also writes these results to a backbone independet rotamer library file (ncaa_rotamers.lib), whose format is similar to the Dunbrack rotamer library.

Note that we're still lacking a proper function to load this contents back into ProtoSyn (which I'll soon add to Peptides/Submodules/Rotamers/load.jl).

Next update: loading backbone independent rotamer libraries

You can find these changes at 1829a5dff47465002b8c3a13637ae72490e33ffc.

This is a quick one: I simply added the Peptides.load_BBI_rotamer_library in order to load the BBI_RotamerLibrary from file (using the experimental format from last update).

I think this concludes adding all minimal necessary changes to support design efforts with NCAA. Of course, I'll continue to work on this issue, adding new NCAA example (who might add extra requirements to the added changes). Let me know what you think so far! Once I have a few minutes I'll try to post a quick example on how to use the current tools! 😄

Example: Using new NCAA in mutation and rotamer exploration

The following "tasks" should now work correctly. See 33c5f25092d89f112dc03a470ce397d0a92b809c.

1. Mutating into an NCAA

julia> ncaa = ProtoSyn.Peptides.load_grammar_from_file("grammars.yml", "ncaa");

julia> pose = ProtoSyn.Peptides.load("2a3d.pdb");

julia> ProtoSyn.Peptides.mutate!(pose, pose.graph[1, 2], ncaa, ["a"]);

2. Loading and exploring a backbone independent (BBI) rotamer library

julia> rot_lib = Peptides.load_BBI_rotamer_library(Float64, "resources/Peptides/ncaa_rotamers.lib");

julia> Peptides.apply!(pose.state, rot_lib[1], pose.graph[1, 2]):

3. Using BBI rotamer libraries in RotamerMutator instances

julia> rot_mut = Peptides.Mutators.RotamerMutator(rot_lib, 1.0, 10, rn"CMA" & an"CA");

julia> rot_mut(pose)

Next update: applying post-translational modifications (PTMs)

See 56549c2967b704cd2110a42912e54404a032519e.

Even though using completly new non-canonical aminoacids is an interesting concept, more often than not, pratical applications include the introduction of post-translational modifications (PTMs). The resulting aminoacids (often considered NCAA), are derivations of existing natural aminoacids. As examples of post-translational modifications, we may consider processes of methylation and phosphorylation. In this next update, I've added the possibility to introduce these types of changes. This was easily done by adding two changes to the code:

Added the resources/Modifications/grammars.yml - This is simply a repository to save the Fragments to be added (for example, the methyl or phosphoryl groups).
Added the replace_by_fragment! function - This function picks an atom and replaces it with the selected radical group (from the new modification grammar).

The modifications grammar is automatically loaded when pre-compiling ProtoSyn and can be accessed in the variable ProtoSyn.modification_grammar.

Two examples:

julia> pose = Peptides.build(Peptides.grammar, seq"ASA");

julia> ProtoSyn.replace_by_fragment!(pose, pose.graph[1, 2, "HG"], modification_grammar.variables["PO4"]);

julia> pose = Peptides.build(Peptides.grammar, seq"ARA");

julia> ProtoSyn.replace_by_fragment!(pose, pose.graph[1, 2, "HH22"], modification_grammar.variables["MET"]);

ncaa

These new changes do not add any chi dihedral definition (I suppose the existing Dunbrack backbone dependent rotamer library becomes null, as the PTM skew the measured probabilities for each rotamer). So far, ProtoSyn only includes the chance for methylation and phosphorylation, but more can and will be added. In order to introduce new modifications, all that is required is to add the .yml file on the resources/Modifications folder, where a defined root atom (the first in the .yml file) is removed, but used to align the new group with the selected atom for replacement.

I hope these changes help. What do you think, what else can ProtoSyn do to help simulate or manipulate NCAAs?

Small update: Merging new NCAA grammar with existing natural aminoacids grammar

See d3f5ecee19f42bc7a0cebbdcd67d9c170c99e031.

In most applications (such as using the Peptides.add_hydrogens! method, for example), a Grammar is requested (in this case, to know the position of the currently non-existent hydrogens). I was having issues when trying to loop over both natural aminoacids and NCAAs: I had to run the function twice, once with the default grammar and another with the NCAA grammar.

For this reason I added the Peptides.join_grammars! and load_grammar_from_file! methods. The new usage is something like:

julia> ProtoSyn.Peptides.load_grammar_from_file!(Peptides.grammar, joinpath(Peptides.resource_dir, "grammars.yml"), "ncaa");
┌ Warning: Operator α found in existing grammar with same key. Skipping addition of operator from new grammar. Check if this is the desired behaviour.
└ @ ProtoSyn.Peptides ~/ProtoSyn.jl/src/Peptides/Submodules/Builder/grammar.jl:56
┌ Warning: Operator β found in existing grammar with same key. Skipping addition of operator from new grammar. Check if this is the desired behaviour.
└ @ ProtoSyn.Peptides ~/ProtoSyn.jl/src/Peptides/Submodules/Builder/grammar.jl:56
┌ Warning: Default operator of existing grammar was not changed. Skipping addition of default operator from new grammar. Check if this is the desired behaviour.
└ @ ProtoSyn.Peptides ~/ProtoSyn.jl/src/Peptides/Submodules/Builder/grammar.jl:62

As the warnings suggest, no overwritting of existing operators of variables occurs. Any overwritting needs to be done manually, for example:

julia> new_grammar = ProtoSyn.Peptides.load_grammar_from_file(joinpath(Peptides.resource_dir, "grammars.yml"), "ncaa");

julia> Peptides.grammar.operators["α"] = new_grammar.operators["new_α"]

This seems to be a more natural and direct application of Grammars. Now all required variables are in Peptides.grammar.

sergio-santos-group / ProtoSyn.jl

Extend to support non-canonical amino acids #31