Open jbarnoud opened 6 years ago
PDB 1KP5 is a cyclic protein. The only solution I see at present is indeed cutting the protein in two different ways, rearranging the residues and present both to DSSP, where the secondary structure over the non-breaking break in the original is taken from the alternatively cut one.
https://networkx.github.io/documentation/stable/reference/algorithms/cycles.html with the residue graph. To ruin your day, are cysteine-bridge-crosslinked proteins cyclic? NX will think they are. Either way this is all rather protein specific.
@pckroon We could indeed use simple_cycle
on the residue graph, and detect if there is a cycle such as len(list(nx.simple_cycles(G))) == len(G.nodes)
. If there is a cystein bridge between the two terminii, it would be detected as a cyclic protein. It is an edge case, but there are enough protein out there to stumble upon it. I do not see a way around that edge case that would not be force field specific.
A cyclic protein is one in which the last residue connects (back) to the first. So it's just a matter of checking if last and first in the residue graph are connected by a backbone link. Nothing about cyclic graphs, which also will bite back if there are missing residues.
Haha! Missing residues! I forgot about them! They should be taken care of prior to the secondary to most step anyway, shouldn't they?
But @Tsjerk is right about the detection. I was thinking about cases were the cut would happen at an other location, but it wouldn't as the cut is purely syntactic and not semantic at all. Cystein bridges between the terminii is still an issue, though.
An SS bridge has no consequences for the secondary structure determination. The only point is that DSSP does not consider the possibility of continuity over the ends of the chain, so DSSP needs to be fooled in thinking that there is no break there, which can be achieved by moving the break somewhere else. That, or DSSP needs to be replaced by something that does a better job, but that would be more complicated and won't do for the short term.
The problem is not so much what DSSP would do with that file. The problem is to know if we need to do the special handling. If we have a cystein bridge between the two terminii, then the program thinks it has to do the handling, while actually it does not.
The test is: is there a backbone link between residues 1 and N? Only the -(C=O)-N(H)- peptide bond is to trigger a double DSSP pass.
This is the test indeed, but I was hoping we could get rid of the force field specificity. I'll make it work for the "universal" force field first. I have an idea on how to make it more general, but I'll keep it for later.
Either way we'll need to reconstruct world+dog+kitchensink before DSSP can interpret it, since it needs coordinates.
I don't nescessarily trust resid numbers, so I'm more in favor of a graph approach (nail, meet hammer). If we make a graph of one node per residue, termini should have degree 1. If we don't find any, it could be a cyclic protein (because terminal cysteinebridge nonsense). Then we could either look at min/max resid, or ask networkx to find the largest cycle on the atomistic level. In this cycle we can check whether all the bonds look like backbone bonds.
Anyway, this is all still very protein specific. What happens now if someone feeds us a DNA? Do we try to feed it to DSSP and watch the flames?
For the moment, only proteins are passed to DSSP. There is a is_protein(molecule)
selector function that returns True if all the resnames are from a predefined list. At the moment, the list is in a constant somewhere, but I plan on having it in the ForceField instead. We could also have a list of the atom names in the backbone stored in the ForceField as well.
I will likely not implement it in #41 where I planned to because the PR is already enormous. So I write it down here.
We need to support secondary structures for cyclic proteins. Some thought were already led down in https://github.com/jbarnoud/martinize2/pull/41#issuecomment-362806623.
We also need a function to test if a protein is cyclic. This should be doable by checking if the first amino-acid residue of the residue graph is a neighbor of the last one.
Testing of this will be made easier with #48 fixed as most cyclic PDB files are NMR structures with models.