Closed tabinks closed 9 years ago
:open_mouth: I'm pretty sure that I've violated the TOS by looking at this while on the Argonne network.
[[-0.37957218 -0.9246734 -0.03006768 0. ]
[-0.71509376 0.27261054 0.64368036 0. ]
[-0.58699735 0.26582437 -0.76470355 0. ]
[ 0. 0. 0. 1. ]]
This is the superposition after applying the rotation matrix:
Worlds quickest PyMOL introduction:
Useful PyMol Commands:
load 1mbn.pdb
loads a fileshow cartoon
hide sticks; hide lines
ray
ray traces a structure to make it look prettysave file.png
saves what you see on the screen as a png fileSome questions:
1) How do we apply the rotation matrix to pdb files to get the superposition? 2) If the rotation matrix for the alignment is not the same as the rotation matrix above, is the alignment necessarily incorrect? For example, my rotation matrix after downloading the relevant pdbs from RCSB:
[[-0.33829432 -0.93634242 -0.09391389 0. ] [-0.66054726 0.16519408 0.7323853 0. ] [-0.67024941 0.30979635 -0.67438264 0. ] [ 0. 0. 0. 1. ]]
Also, if I change the order in which the pdbs are passed to my function:
[[ 1.00000000e+00 5.55111512e-17 0.00000000e+00 0.00000000e+00] [ 2.77555756e-16 1.00000000e+00 -1.11022302e-16 0.00000000e+00] [ -2.01227923e-16 -6.93889390e-17 1.00000000e+00 0.00000000e+00] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00]]
Also my RMSD is 8.4325840831? Not sure what's up.
EDIT: Okay, I think the RMSD's are the same and the reason I got ~8.4 was because I chomped the longer sequence instead of aligning contiguous subsequences with length of the shorter subsequence to the shorter sequence. And also, I think the RMSD can be the same but the rotational matrix might differ?
3) Is it correct that, when doing a local alignment, we consider all 10 consecutive residue subsets of one protein sequence against the entire reference sequence?
For (1), the Soup
class has a transform
method. I think you just pass the matrix to that, and then call write_pdb
to save it to a new file.
Question: should structural alignments preserve the order of residues in each protein? That limits the combinatorial explosion, but not nearly enough to make searching all possibilities tractable.
For a global alignment, there are nCm
possible subsequences of the longer sequence to align to the shorter one, where n
is the length of the longer sequence, and m
is the length of the shorter sequence (and that's if we require preserving the order). For q1, that's 153C145 = 6183023199255
. Clearly we can't look at all possibilities, and need a heuristic.
Brainstorming ideas:
That sounds right, but the transformed PDB seems to only have one protein in it, but maybe I'm visualizing it incorrectly?
The contiguous sequences seem like a no-go if we can't get toward that ~3.5 in Prof. Binkowski's output...
Maybe we could take the subsequences with the best ratio of amino acids measured against the ratio in the longer subsequence? However, the amino acid ratios and maybe even S-W don't necessarily suggest anything about the RMSD of the alignment. Seems like we should select the subsequences with conserved zippers/helices/sheets, because that hints at homology--but then, what's the complexity of that and how do we do it?
Feeling lost right now :grimacing:
Also, if I read this correctly, our issues are a part of a much-larger unsolved problem in bioinformatics: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3591119/
Ya I think the idea is to try different things and see what gives good results. I'm also trying random subsequences.
I wonder if I'm doing something wrong with the RMSD, the best score I've seen so far is 23.8.
@tabinks, can you define 'global' structural alignment? I don't understand how you and @manglano are getting such low scores, my lowest ~23.8. Maybe I am interpreting global differently/wrong?
For sequences in SW, global relates to how the score is optimized, and basically amounts to aligning the entire sequences, but because of gaps the number of actually aligned residues can vary between min(size of sequences) and 0 (all gaps). I notice that if I do S-W alignment and then calculate the RMSD, the number of residues vary, even among all optimal alignments. Interestingly the RMSD still very close for all of them though (the mean part of RMSD takes care of that I guess).
The most literal interpretation of 'global' structural alignment would be to align as many residues as possible, i.e. the size should be equal to the size of the smallest protein chain. Then it just amounts to picking which residues in the longer sequence to align to the shorter sequence. This was my initial approach, but again I'm getting much higher rmsd values.
Here's an idea: instead of using the scoring matrix to score S-W, what about using the RMSD to score it? Would have to swap the signed around since low RMSD is better, but that seems like a reasonable way to attack the problem.
EDIT: I don't think this works. RMSD has to be based on the entire solution, because each optimal local RMSD is likely to have a different rotation. Maybe there is still some way to 'score' residue alignments based on something from the pdb geometry, but it's not obvious to me how.
@manglano
transform
function: soup1.transform(rotation_matrix)
@bdallen-uchicago
Question: should structural alignments preserve the order of residues in each protein? That limits the combinatorial explosion, but not nearly enough to make searching all possibilities tractable.
For the global and local, yes preserve the order of the residues in the protein. For the surface alignment, you should not preserve the ordering. You should be reordering them.
For a global alignment, there are nCm possible subsequences of the longer sequence to align to the shorter one, where n is the length of the longer sequence, and m is the length of the shorter sequence (and that's if we require preserving the order). For q1, that's 153C145 = 6183023199255. Clearly we can't look at all possibilities, and need a heuristic.
Yes, you will need to develop heuristics to make the alignment possible.
@tabinks, can you define 'global' structural alignment? I don't understand how you and @manglano are getting such low scores, my lowest ~23.8. Maybe I am interpreting global differently/wrong?
Global means that you are using the entire length of at least one of the sequences. Take the shortest sequence and slide it along the longer sequence. Check the RMSD for each pair until you have covered the entire sequence. Use the smallest RMSD. It sounds like you understood this, but I'm not sure why you got such high RMSD.
For the global alignment - make sure to use numpy_svd_rmsd_rot to calculate the RMSD, not the sum_rmsd. With the numpy_svd_rmsd method I get a rmsd of 3.4096 and the same rotation matrix as @tabinks (I think you might have missed a 0 in your rmsd since you reported it as 3.496). Using sum_rmsd (which is used by default as long as transform_pdb1 is not None), I get the 23.8 rmsd that others have reported.
I'm still having issues with the visualization, though. I transformed the 1mbn soup with the rotation matrix and then centered it around 3qm9's center, but my picture doesn't look like Andrew's picture. If I don't center the soup around 3qm9's center, it's even worse.
I've been using rmsd.calc_rmsd_rot
, and I get the same result (~23.8) calling the numpy one directly. Can someone confirm that the length of the superposable atoms is 153 for 1mbn, and 145 for 3qm9? And that you get ~3.4 score just from sliding the shorter 3qm9 sequence along the longer one, no gaps?
EDIT: I think the problem is that I'm not centering, looking at the source of rmsd_of_soups
.
I can confirm that if I use rmsd_of_soups
, which does the centering internally, that I end up with 3.4097
at offset 5
. RIP the several hours I wasted trying to come up with clever alternate selection strategies :'(.
I wonder if the trouble people are having generating the visualization is also related to centering?
Things to keep in mind for the visualization:
pdbremix.v3.left_inverse
.soup1.transform(v3.translation(center2))
. Note that this is the opposite of what is used to center soup2.Do we need to center before we do the RMSD? How do we handle this for multiple sequences--do we only center one before looping through RMSD calculations, recenter each time, or (heaven forbid) make a new set of soups for every subsequence?
It seems rmsd_of_soups allows the user to specify an index fragment which could significantly shorten the code length but as it stands my global alignment function is like ~40 lines long and I have no idea how to shorten it. @bdallen-uchicago, how did you use rmsd_of_soups to do a slide? If I crack the slide problem I think I'll be in good shape to finish this homework by tonight.
I refactored so I do the centering myself and didn't need to use rmsd_of_soups
. I think the way it globally modifies the input soup is terrible design, and makes sorting out how to transform the soup to get the correct visualization confusing. It should at the very least have a loud warning in the docstring that it modifies it's input params. It's pretty simple to do the centering yourself without modifying the soup, e.g.:
def center_vlist(vlist):
"""
Center a list of v3 vectors and return (center, centered_vlist_copy)
"""
center = v3.get_center(vlist)
center_matrix = v3.translation(-center)
return (center, [v3.transform(center_matrix, v) for v in vlist])
If you keep track of the center used on each alignment, and save it each time you find a new 'best' alignment, you can use that to translate the soup (or better yet a copy of the soup) at the end to generate the transformed pdb and visualization.
The code quality in pdbremix is mixed, it definitely suffers from being written mostly by single author. If you really want to use rmsd_of_soups, you need to pass it residue tags in the segment lists, which is awkward, but you can get the tag from an atom: atom.res_tag()
and use that to build the segments, e.g. segment1 = [(atoms[start1].res_tag(), atoms[end1].res_tag())]
.
@manglano to answer your question more directly, I'm pretty sure that you always need to center both sets of coordinates before doing the RMSD/rot calculation. The centering needs to be based on the set of coordinates you are passing, not on the center of the entire soup.
@bdallen-uchicago Aha! Okay, I think I see. The RMSD calculation occurs independently of the geometric stuff, but we should calculate & store the output of the geometric functions for each set of sequences to get the correct visualization for the best RMSD. Thanks!
@tabinks: Is the subset of a coordinate set functionally equivalent to the subset of a protein sequence?
Ill post how to apply rotation later tonight.
This is what I did, it may be different from what you did, but that doesn't mean yours is wrong. Without seeing what you did, I'm guessing you only transformed your coordinate array to the center of mass and forgot to also apply it to the soup.
# Identify the center of mass of the structure
center1 = v3.get_center(crds1)
center2 = v3.get_center(crds2)
# Transform the structure to its own center of mass
soup1.transform(v3.translation(-center1))
soup2.transform(v3.translation(-center2))
# Calculate RMSD
calculated_rmsd, transform_1_to_2 = rmsd.calc_rmsd_rot(crds1, crds2)
soup1.transform(transform_1_to_2)
# Write out a PDB file of the mobile structure
soup1.write_pdb(transformed_pdb1)
@bdallen-uchicago @tabinks Okay, so my global alignment tries all the subsequences with equal length, centers the coordinate sets, and then aligns the coordinates with 'rmsd.calc_rmsd_rot(crds1copy, crds2copy)', and then dumps all the information into a tuple to find the minimum RMSD and do a translation on the appropriate pdb for the centering coordinates and the rotation matrix.
Buuuut now my minimum RMSD is ~23.8. And, the center of mass for a subset of the coordinates isn't necessarily the center of mass for the whole protein. I'm not sure how I got here, nor of what I need to fix. I'm going to keep implementing so I have some functional solution.
@manglano 23-24 is the RMSD I was getting when the coords I passed to rmsd.calc_rmsd_rot
weren't centered, but there could be other causes. Notice that the code posted by @tabinks which is similar to rmsd_of_soups
also calculates the center of just the coords in the alignment.
This is how I do it, using the center_vlist
I posted above:
# for each pair of contiguous subsets of coords (crds1, crds2)
center1, centered_crds1 = center_vlist(crds1)
center2, centered_crds2 = center_vlist(crds2)
score, rot_matrix = rmsd.calc_rmsd_rot(centered_crds1, centered_crds2)
# compare with current best score, if this is better,
# save score, rot_matrix, center1, center2, offsets, etc
# as the new best (I defined a class for storing these)
This was simpler for me because it made it clearer what was going on with the centering and transformations, and allows me to try lots of different methods for selecting the coords before messing with the soup, but there are other ways as @tabinks said.
I also renamed crds
to coords
because I keep typing it as cdrs
, drives me insane. Tried to use crds here to be consistent, and realized I did it again and had to edit it :P.
Here's the code I use to transform soup1:
soup1x = self.soup1.copy()
soup1x.transform(v3.translation(-self.best.center1))
soup1x.transform(self.best.rot_matrix)
soup1x.transform(v3.translation(self.best.center2))
self.best
is the object where I save all the info related to the best alignment. I have a similar method for transforming soup2 that inverts the rot matrix (EDIT: I did this just for fun/make sure I understand, transforming both of them in this way would mess things up, since I'm already translating soup1 to the center of soup2 here).
Got it. I think I wasn't centering the coordinate matrix before running the RMSD; this passes the test case now. Thank you very much for your help. I think I can map this onto the local alignments and the rest of the homework without too much difficulty but I'll have to puzzle out the visualizations.
Okay. My local alignment works, my global alignment works. I get boundary indices for the local and global alignment residues. The center for each subset of coordinates gets saved to a tuple along with the RMSD so I get the center of mass for each subset. I apply these centers of mass to the soup at the end, saying: 1) move the soup to the center of mass for the subset of its own residues giving the best RMSD 2) transform the soup by the rotation matrix 3) move the soup to the center of mass of the other soup 4) output file.
Then, I get something rotated correctly and very nearly aligned, as though a translation in a line would get all the helices overlapping:
So, I think all the math is correct but something about the transforms isn't. How do I get an awesome superposition like Andrew?
(I'll probably need to submit another set of revisions for this assignment (;___;):ok_hand: )
I'll spend some time reviewing structure alignment tonight at the beginning of class. Everyone will have a chance to resubmit their work.
I hope everyone is having some fun playing around with protein structures. I went through and redid the homework myself and here are a couple of things to keep in mind:
pdbremix
to compute the RMSD was 13 lines long. If yours is much longer than this, you should take another look at the API. You're probably trying too hard.ATOM
.HETATM
refers to heatero atoms which are typically ligands that have been co-crystallized with the protein structure (in this case heme).REMARK
are text notes.CONECT
describe which atoms are connected to each other. This is used by visualization programs.