markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
306 stars 119 forks source link

Secondary structure in PyEMMA #1566

Closed rubinanoor9 closed 2 years ago

rubinanoor9 commented 2 years ago

Hello PyEMMA user! I have trajectories of protein simulation data, in which protein changes its secondary structure during simulation time. So I am trying to build MSM model based on dssp, and want to include complete protein and some residues that changes its secondary structure. For that I have tried to add features like this;

dssp_feature=pyemma.coordinates.data.CustomFeature(lambda x : md.compute_dssp(x).astype('float32')[:, None], dim=1) dssp_feat=pyemma.coordinates.featurizer(pdb) dssp_feat.add_custom_feature(dssp_feature) data = pyemma.coordinates.load(files, features=dssp_feat)

I got; ----> 1 dssp_feature=pyemma.coordinates.data.CustomFeature(lambda x : md.compute_dssp(x).astype('float32')[:, None], dim=1)

ValueError: could not convert string to float: 'C'

I am not be able to add features of dssp. It would be great pleasure for me, if you have a look and suggest me to resolve such issues. I am looking forward to hear from you Thanks in advance Regards Rubina

thempel commented 2 years ago

Hi Rubina, the issue here is that the output of the DSSP are strings that indicate the type of secondary structure (like H for helix or C for coil -- more details are here.) You are trying to type-cast that to a float with .astype('float32'), which doesn't work. If you only need to have information about helical vs. non-helical, you could do something like (mdtraj.compute_dssp(traj) == 'H').astype('float32')[:, None] (PLEASE TEST THIS, I DIDN'T USE IT!). If you want resolution over all secondary structure types, you need to map from the string output to a number (one could e.g. do this with a dictionary, but I'm sure that's very inefficient).

rubinanoor9 commented 2 years ago

Thank you so much for your kind support. Features has been added as per your suggestion. But by this "data = reader.get_output()", I get error as mentioned below. This is my input command line; dssp_feat=pyemma.coordinates.featurizer(pdb) dssp_feature=pyemma.coordinates.data.CustomFeature(lambda traj : (md.compute_dssp(traj) == 'H').astype('float32')[:, None], dim=3) dssp_feat.add_custom_feature(dssp_feature) reader = coor.source(files, features=dssp_feat) data = reader.get_output()

I got following error: Your 'CustomFeature[0][0] calling <function at 0x7f153a416af0> with args {args}'] did not return a 2d array. Shape was (500, 1, 377)

(I am trying with one trajectory, which has 500 frames and 377 residues

If I remove lambda and use, dssp_feature=pyemma.coordinates.data.CustomFeature((md.compute_dssp(traj) == 'H').astype('float32')[:, None], dim=3)

I got error: 'numpy.ndarray' object is not callable

I will be waiting for your kind response. Thanks in advance

thempel commented 2 years ago

Sorry, I believe that my code had an unnecessary dimension expansion. Generally, it may be good to test the custom function that you're giving to the featurizer before applying it. I hope that this is what you wanted:

def custom_function(traj):
    """
    Computes helicity for each frame and each residue in a trajectory
    The dimension will be (n_frames, n_residues)
    """
    return (mdtraj.compute_dssp(traj) == 'H').astype('float32')

topology = mdtraj.load(pdb)
dssp_feature = pyemma.coordinates.data.CustomFeature(custom_function, dim=topology.top.n_residues)

dssp_feat = pyemma.coordinates.featurizer(topology)
dssp_feat.add_custom_feature(dssp_feature)
data = pyemma.coordinates.load(files, features=dssp_feat)

You can test the function e.g. by this code, maybe that makes it simpler to debug:

test_traj = mdtraj.load_xtc(files[0], top=pdb, stride=100)
custom_function(test_traj)
rubinanoor9 commented 2 years ago

Thank you so much for your co-operation. I have solved my problem as per your suggestion. Thanks once again.

thempel commented 2 years ago

Happy to help :)