marrink-lab / vermouth-martinize

Describe and apply transformation on molecular structures and topologies
Apache License 2.0
84 stars 37 forks source link

Handle secondary structures for cyclic proteins #49

Open jbarnoud opened 6 years ago

jbarnoud commented 6 years ago

I will likely not implement it in #41 where I planned to because the PR is already enormous. So I write it down here.

We need to support secondary structures for cyclic proteins. Some thought were already led down in https://github.com/jbarnoud/martinize2/pull/41#issuecomment-362806623.

We also need a function to test if a protein is cyclic. This should be doable by checking if the first amino-acid residue of the residue graph is a neighbor of the last one.

Testing of this will be made easier with #48 fixed as most cyclic PDB files are NMR structures with models.

Tsjerk commented 6 years ago

PDB 1KP5 is a cyclic protein. The only solution I see at present is indeed cutting the protein in two different ways, rearranging the residues and present both to DSSP, where the secondary structure over the non-breaking break in the original is taken from the alternatively cut one.

pckroon commented 6 years ago

https://networkx.github.io/documentation/stable/reference/algorithms/cycles.html with the residue graph. To ruin your day, are cysteine-bridge-crosslinked proteins cyclic? NX will think they are. Either way this is all rather protein specific.

jbarnoud commented 6 years ago

@pckroon We could indeed use simple_cycle on the residue graph, and detect if there is a cycle such as len(list(nx.simple_cycles(G))) == len(G.nodes). If there is a cystein bridge between the two terminii, it would be detected as a cyclic protein. It is an edge case, but there are enough protein out there to stumble upon it. I do not see a way around that edge case that would not be force field specific.

Tsjerk commented 6 years ago

A cyclic protein is one in which the last residue connects (back) to the first. So it's just a matter of checking if last and first in the residue graph are connected by a backbone link. Nothing about cyclic graphs, which also will bite back if there are missing residues.

jbarnoud commented 6 years ago

Haha! Missing residues! I forgot about them! They should be taken care of prior to the secondary to most step anyway, shouldn't they?

But @Tsjerk is right about the detection. I was thinking about cases were the cut would happen at an other location, but it wouldn't as the cut is purely syntactic and not semantic at all. Cystein bridges between the terminii is still an issue, though.

Tsjerk commented 6 years ago

An SS bridge has no consequences for the secondary structure determination. The only point is that DSSP does not consider the possibility of continuity over the ends of the chain, so DSSP needs to be fooled in thinking that there is no break there, which can be achieved by moving the break somewhere else. That, or DSSP needs to be replaced by something that does a better job, but that would be more complicated and won't do for the short term.

jbarnoud commented 6 years ago

The problem is not so much what DSSP would do with that file. The problem is to know if we need to do the special handling. If we have a cystein bridge between the two terminii, then the program thinks it has to do the handling, while actually it does not.

Tsjerk commented 6 years ago

The test is: is there a backbone link between residues 1 and N? Only the -(C=O)-N(H)- peptide bond is to trigger a double DSSP pass.

jbarnoud commented 6 years ago

This is the test indeed, but I was hoping we could get rid of the force field specificity. I'll make it work for the "universal" force field first. I have an idea on how to make it more general, but I'll keep it for later.

pckroon commented 6 years ago

Either way we'll need to reconstruct world+dog+kitchensink before DSSP can interpret it, since it needs coordinates.

I don't nescessarily trust resid numbers, so I'm more in favor of a graph approach (nail, meet hammer). If we make a graph of one node per residue, termini should have degree 1. If we don't find any, it could be a cyclic protein (because terminal cysteinebridge nonsense). Then we could either look at min/max resid, or ask networkx to find the largest cycle on the atomistic level. In this cycle we can check whether all the bonds look like backbone bonds.

Anyway, this is all still very protein specific. What happens now if someone feeds us a DNA? Do we try to feed it to DSSP and watch the flames?

jbarnoud commented 6 years ago

For the moment, only proteins are passed to DSSP. There is a is_protein(molecule) selector function that returns True if all the resnames are from a predefined list. At the moment, the list is in a constant somewhere, but I plan on having it in the ForceField instead. We could also have a list of the atom names in the backbone stored in the ForceField as well.