You Can't Unsee These Homework Tips

tabinks commented 9 years ago

I hope everyone is having some fun playing around with protein structures. I went through and redid the homework myself and here are a couple of things to keep in mind:

You can only compare structures of the same size (i.e. the matrices must be the same size). For global alignment, the structures are not the same size, so you will need to determine a strategy to find the best global alignment from all possible global alignments.
The main alignment code using pdbremixto compute the RMSD was 13 lines long. If yours is much longer than this, you should take another look at the API. You're probably trying too hard.
Remember that we only will be using the CA (alpha carbon) atoms to represent our structure. You should be using something like this to select the atoms and convert them into a matrix.

soup1 = pdbatoms.Soup(pdb1)
atoms1 = rmsd.get_superposable_atoms(soup1, None, ['CA'])
crds1 = [a.pos for a in atoms1]

We are only interested in using lines in the .pdb file that begin with ATOM.
- HETATM refers to heatero atoms which are typically ligands that have been co-crystallized with the protein structure (in this case heme).
- REMARK are text notes.
- CONECT describe which atoms are connected to each other. This is used by visualization programs.

bdallen-uchicago commented 9 years ago

:open_mouth: I'm pretty sure that I've violated the TOS by looking at this while on the Argonne network.

tabinks commented 9 years ago

Question1 Answers

For question 1, the best RMSD I found was 3.496. The rotation matrix was:

[[-0.37957218 -0.9246734  -0.03006768  0.        ]
 [-0.71509376  0.27261054  0.64368036  0.        ]
 [-0.58699735  0.26582437 -0.76470355  0.        ]
 [ 0.          0.          0.          1.        ]]

This is the superposition after applying the rotation matrix:

screenshot 2015-11-04 15 09 55

tabinks commented 9 years ago

Worlds quickest PyMOL introduction:

https://drive.google.com/open?id=0B3XzcKIiWyccMF9RVHprWE1GbVU

Useful PyMol Commands:

load 1mbn.pdb loads a file
show cartoon
hide sticks; hide lines
ray ray traces a structure to make it look pretty
save file.png saves what you see on the screen as a png file

ghost commented 9 years ago

Some questions:

1) How do we apply the rotation matrix to pdb files to get the superposition? 2) If the rotation matrix for the alignment is not the same as the rotation matrix above, is the alignment necessarily incorrect? For example, my rotation matrix after downloading the relevant pdbs from RCSB:

[[-0.33829432 -0.93634242 -0.09391389 0. ] [-0.66054726 0.16519408 0.7323853 0. ] [-0.67024941 0.30979635 -0.67438264 0. ] [ 0. 0. 0. 1. ]]

Also, if I change the order in which the pdbs are passed to my function:

[[ 1.00000000e+00 5.55111512e-17 0.00000000e+00 0.00000000e+00] [ 2.77555756e-16 1.00000000e+00 -1.11022302e-16 0.00000000e+00] [ -2.01227923e-16 -6.93889390e-17 1.00000000e+00 0.00000000e+00] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00]]

Also my RMSD is 8.4325840831? Not sure what's up.

EDIT: Okay, I think the RMSD's are the same and the reason I got ~8.4 was because I chomped the longer sequence instead of aligning contiguous subsequences with length of the shorter subsequence to the shorter sequence. And also, I think the RMSD can be the same but the rotational matrix might differ?

3) Is it correct that, when doing a local alignment, we consider all 10 consecutive residue subsets of one protein sequence against the entire reference sequence?

bdallen-uchicago commented 9 years ago

For (1), the Soup class has a transform method. I think you just pass the matrix to that, and then call write_pdb to save it to a new file.

bdallen-uchicago commented 9 years ago

Question: should structural alignments preserve the order of residues in each protein? That limits the combinatorial explosion, but not nearly enough to make searching all possibilities tractable.

For a global alignment, there are nCm possible subsequences of the longer sequence to align to the shorter one, where n is the length of the longer sequence, and m is the length of the shorter sequence (and that's if we require preserving the order). For q1, that's 153C145 = 6183023199255. Clearly we can't look at all possibilities, and need a heuristic.

Brainstorming ideas:

only look at the contiguous sequences (there are 9 for q1). This is super simple, but does not yield great results; the best score I get is 23.9.
extract the sequences, and do an S-W global alignment. Convert back to the residue numbers, and use that as the start position for a local search, e.g. look for single residue changes that increase the score.
Greedily build up a solution geometrically

ghost commented 9 years ago

That sounds right, but the transformed PDB seems to only have one protein in it, but maybe I'm visualizing it incorrectly?

The contiguous sequences seem like a no-go if we can't get toward that ~3.5 in Prof. Binkowski's output...

Maybe we could take the subsequences with the best ratio of amino acids measured against the ratio in the longer subsequence? However, the amino acid ratios and maybe even S-W don't necessarily suggest anything about the RMSD of the alignment. Seems like we should select the subsequences with conserved zippers/helices/sheets, because that hints at homology--but then, what's the complexity of that and how do we do it?

Feeling lost right now :grimacing:

Also, if I read this correctly, our issues are a part of a much-larger unsolved problem in bioinformatics: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3591119/

bdallen-uchicago commented 9 years ago

Ya I think the idea is to try different things and see what gives good results. I'm also trying random subsequences.

I wonder if I'm doing something wrong with the RMSD, the best score I've seen so far is 23.8.

bdallen-uchicago commented 9 years ago

@tabinks, can you define 'global' structural alignment? I don't understand how you and @manglano are getting such low scores, my lowest ~23.8. Maybe I am interpreting global differently/wrong?

For sequences in SW, global relates to how the score is optimized, and basically amounts to aligning the entire sequences, but because of gaps the number of actually aligned residues can vary between min(size of sequences) and 0 (all gaps). I notice that if I do S-W alignment and then calculate the RMSD, the number of residues vary, even among all optimal alignments. Interestingly the RMSD still very close for all of them though (the mean part of RMSD takes care of that I guess).

The most literal interpretation of 'global' structural alignment would be to align as many residues as possible, i.e. the size should be equal to the size of the smallest protein chain. Then it just amounts to picking which residues in the longer sequence to align to the shorter sequence. This was my initial approach, but again I'm getting much higher rmsd values.

bdallen-uchicago commented 9 years ago

Here's an idea: instead of using the scoring matrix to score S-W, what about using the RMSD to score it? Would have to swap the signed around since low RMSD is better, but that seems like a reasonable way to attack the problem.

EDIT: I don't think this works. RMSD has to be based on the entire solution, because each optimal local RMSD is likely to have a different rotation. Maybe there is still some way to 'score' residue alignments based on something from the pdb geometry, but it's not obvious to me how.

tabinks commented 9 years ago

@manglano

To apply a transform use the transform function: soup1.transform(rotation_matrix)
You should calculate all 10 residues stretcher from the reference against all stretch of 10 residues in the mobile.
You are correct, you are only transforming one of the PDBs. You will need to visualize them both at the same time to visualize the alignment.

tabinks commented 9 years ago

@bdallen-uchicago

Question: should structural alignments preserve the order of residues in each protein? That limits the combinatorial explosion, but not nearly enough to make searching all possibilities tractable.

For the global and local, yes preserve the order of the residues in the protein. For the surface alignment, you should not preserve the ordering. You should be reordering them.

For a global alignment, there are nCm possible subsequences of the longer sequence to align to the shorter one, where n is the length of the longer sequence, and m is the length of the shorter sequence (and that's if we require preserving the order). For q1, that's 153C145 = 6183023199255. Clearly we can't look at all possibilities, and need a heuristic.

Yes, you will need to develop heuristics to make the alignment possible.

tabinks commented 9 years ago

@tabinks, can you define 'global' structural alignment? I don't understand how you and @manglano are getting such low scores, my lowest ~23.8. Maybe I am interpreting global differently/wrong?

Global means that you are using the entire length of at least one of the sequences. Take the shortest sequence and slide it along the longer sequence. Check the RMSD for each pair until you have covered the entire sequence. Use the smallest RMSD. It sounds like you understood this, but I'm not sure why you got such high RMSD.

sdkramer10 commented 9 years ago

For the global alignment - make sure to use numpy_svd_rmsd_rot to calculate the RMSD, not the sum_rmsd. With the numpy_svd_rmsd method I get a rmsd of 3.4096 and the same rotation matrix as @tabinks (I think you might have missed a 0 in your rmsd since you reported it as 3.496). Using sum_rmsd (which is used by default as long as transform_pdb1 is not None), I get the 23.8 rmsd that others have reported.

I'm still having issues with the visualization, though. I transformed the 1mbn soup with the rotation matrix and then centered it around 3qm9's center, but my picture doesn't look like Andrew's picture. If I don't center the soup around 3qm9's center, it's even worse.

bdallen-uchicago commented 9 years ago

I've been using rmsd.calc_rmsd_rot, and I get the same result (~23.8) calling the numpy one directly. Can someone confirm that the length of the superposable atoms is 153 for 1mbn, and 145 for 3qm9? And that you get ~3.4 score just from sliding the shorter 3qm9 sequence along the longer one, no gaps?

EDIT: I think the problem is that I'm not centering, looking at the source of rmsd_of_soups.

bdallen-uchicago commented 9 years ago

I can confirm that if I use rmsd_of_soups, which does the centering internally, that I end up with 3.4097 at offset 5. RIP the several hours I wasted trying to come up with clever alternate selection strategies :'(.

I wonder if the trouble people are having generating the visualization is also related to centering?

bdallen-uchicago commented 9 years ago

Things to keep in mind for the visualization:

The rotation matrix returned by calc_rmsd_rot rotates the first coords into the second coords. This works great if you are writing a new version of the first pdb, but if you want to reverse it and transform the second pdb, you can use pdbremix.v3.left_inverse.
In either case, make sure to center the soup before applying the rotation! EDIT: and center it using the center of the aligned subset of coords, which for the longer sequence won't be all coords.
The final step is to translate the centered, rotated soup to the center of the other pdb. For example, if transforming soup1, you would use soup1.transform(v3.translation(center2)). Note that this is the opposite of what is used to center soup2.

ghost commented 9 years ago

Do we need to center before we do the RMSD? How do we handle this for multiple sequences--do we only center one before looping through RMSD calculations, recenter each time, or (heaven forbid) make a new set of soups for every subsequence?

It seems rmsd_of_soups allows the user to specify an index fragment which could significantly shorten the code length but as it stands my global alignment function is like ~40 lines long and I have no idea how to shorten it. @bdallen-uchicago, how did you use rmsd_of_soups to do a slide? If I crack the slide problem I think I'll be in good shape to finish this homework by tonight.

bdallen-uchicago commented 9 years ago

I refactored so I do the centering myself and didn't need to use rmsd_of_soups. I think the way it globally modifies the input soup is terrible design, and makes sorting out how to transform the soup to get the correct visualization confusing. It should at the very least have a loud warning in the docstring that it modifies it's input params. It's pretty simple to do the centering yourself without modifying the soup, e.g.:

def center_vlist(vlist):
    """
    Center a list of v3 vectors and return (center, centered_vlist_copy)
    """
    center = v3.get_center(vlist)
    center_matrix = v3.translation(-center)
    return (center, [v3.transform(center_matrix, v) for v in vlist])

If you keep track of the center used on each alignment, and save it each time you find a new 'best' alignment, you can use that to translate the soup (or better yet a copy of the soup) at the end to generate the transformed pdb and visualization.

The code quality in pdbremix is mixed, it definitely suffers from being written mostly by single author. If you really want to use rmsd_of_soups, you need to pass it residue tags in the segment lists, which is awkward, but you can get the tag from an atom: atom.res_tag() and use that to build the segments, e.g. segment1 = [(atoms[start1].res_tag(), atoms[end1].res_tag())].

bdallen-uchicago commented 9 years ago

@manglano to answer your question more directly, I'm pretty sure that you always need to center both sets of coordinates before doing the RMSD/rot calculation. The centering needs to be based on the set of coordinates you are passing, not on the center of the entire soup.

ghost commented 9 years ago

@bdallen-uchicago Aha! Okay, I think I see. The RMSD calculation occurs independently of the geometric stuff, but we should calculate & store the output of the geometric functions for each set of sequences to get the correct visualization for the best RMSD. Thanks!

@tabinks: Is the subset of a coordinate set functionally equivalent to the subset of a protein sequence?

tabinks commented 9 years ago

Ill post how to apply rotation later tonight.

tabinks commented 9 years ago

This is what I did, it may be different from what you did, but that doesn't mean yours is wrong. Without seeing what you did, I'm guessing you only transformed your coordinate array to the center of mass and forgot to also apply it to the soup.

# Identify the center of mass of the structure
center1 = v3.get_center(crds1)
center2 = v3.get_center(crds2)

# Transform the structure to its own center of mass
soup1.transform(v3.translation(-center1))
soup2.transform(v3.translation(-center2))

# Calculate RMSD
calculated_rmsd, transform_1_to_2 = rmsd.calc_rmsd_rot(crds1, crds2)
soup1.transform(transform_1_to_2)

# Write out a PDB file of the mobile structure
soup1.write_pdb(transformed_pdb1)

ghost commented 9 years ago

@bdallen-uchicago @tabinks Okay, so my global alignment tries all the subsequences with equal length, centers the coordinate sets, and then aligns the coordinates with 'rmsd.calc_rmsd_rot(crds1copy, crds2copy)', and then dumps all the information into a tuple to find the minimum RMSD and do a translation on the appropriate pdb for the centering coordinates and the rotation matrix.

Buuuut now my minimum RMSD is ~23.8. And, the center of mass for a subset of the coordinates isn't necessarily the center of mass for the whole protein. I'm not sure how I got here, nor of what I need to fix. I'm going to keep implementing so I have some functional solution.

bdallen-uchicago commented 9 years ago

@manglano 23-24 is the RMSD I was getting when the coords I passed to rmsd.calc_rmsd_rot weren't centered, but there could be other causes. Notice that the code posted by @tabinks which is similar to rmsd_of_soups also calculates the center of just the coords in the alignment.

bdallen-uchicago commented 9 years ago

This is how I do it, using the center_vlist I posted above:

# for each pair of contiguous subsets of coords (crds1, crds2)
center1, centered_crds1 = center_vlist(crds1)
center2, centered_crds2 = center_vlist(crds2)
score, rot_matrix = rmsd.calc_rmsd_rot(centered_crds1, centered_crds2)
# compare with current best score, if this is better,
# save score, rot_matrix, center1, center2, offsets, etc
# as the new best (I defined a class for storing these)

This was simpler for me because it made it clearer what was going on with the centering and transformations, and allows me to try lots of different methods for selecting the coords before messing with the soup, but there are other ways as @tabinks said.

I also renamed crds to coords because I keep typing it as cdrs, drives me insane. Tried to use crds here to be consistent, and realized I did it again and had to edit it :P.

bdallen-uchicago commented 9 years ago

Here's the code I use to transform soup1:

    soup1x = self.soup1.copy()
    soup1x.transform(v3.translation(-self.best.center1))
    soup1x.transform(self.best.rot_matrix)
    soup1x.transform(v3.translation(self.best.center2))

self.best is the object where I save all the info related to the best alignment. I have a similar method for transforming soup2 that inverts the rot matrix (EDIT: I did this just for fun/make sure I understand, transforming both of them in this way would mess things up, since I'm already translating soup1 to the center of soup2 here).

ghost commented 9 years ago

Got it. I think I wasn't centering the coordinate matrix before running the RMSD; this passes the test case now. Thank you very much for your help. I think I can map this onto the local alignments and the rest of the homework without too much difficulty but I'll have to puzzle out the visualizations.

ghost commented 9 years ago

Okay. My local alignment works, my global alignment works. I get boundary indices for the local and global alignment residues. The center for each subset of coordinates gets saved to a tuple along with the RMSD so I get the center of mass for each subset. I apply these centers of mass to the soup at the end, saying: 1) move the soup to the center of mass for the subset of its own residues giving the best RMSD 2) transform the soup by the rotation matrix 3) move the soup to the center of mass of the other soup 4) output file.

Then, I get something rotated correctly and very nearly aligned, as though a translation in a line would get all the helices overlapping:

So, I think all the math is correct but something about the transforms isn't. How do I get an awesome superposition like Andrew?

(I'll probably need to submit another set of revisions for this assignment (;___;):ok_hand: )

tabinks commented 9 years ago

I'll spend some time reviewing structure alignment tonight at the beginning of class. Everyone will have a chance to resubmit their work.

uchicago-bio / 2015-Autumn-Forum

You Can't Unsee These Homework Tips #63

Question1 Answers