markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
311 stars 119 forks source link

Creating custom features #1512

Closed cgseitz closed 2 years ago

cgseitz commented 3 years ago

Hello,

I am trying to create three features, which I believe need to be created with the pyemma.coordinates.data.featurization.misc.GroupCOMFeature (but I could be wrong). I created center of mass a, b, c, and d. I want to measure the distance between group a and d, b and d, and c and d. Here is an example for group a and d:

feat_breathing_2009g = pyemma.coordinates.featurizer(top1)
from pyemma.coordinates.data.featurization.misc import GroupCOMFeature
COM_RBD1 = GroupCOMFeature(feat_breathing_2009g.topology, [[list(range(66,286))]])
COM_central_helices_top = GroupCOMFeature(feat_breathing_2009g.topology, [[419, 985, 1551]])
def feature_1(traj: md.Trajectory):
    centers1 = COM_RBD1.transform(traj)  # yields ndarray
    centers2 = COM_central_helices_top.transform(traj)  # yields ndarray
    xyz = np.hstack((centers1, centers2))
    traj = md.Trajectory(xyz.reshape(-1, 3, 3), topology=None)
    # this has shape (n_frames, 1)
    #distance1 = np.linalg.norm(centers1 - centers2)
    #distance1 = np.linalg.norm(traj)
    #distance1 = md.compute_distances(traj, atom_pairs=[[0, 1]])
    #distance1 = np.linalg.norm(np.subtract(centers1,centers2))
    return distance1

where the commented out distance1 are things I've tried. Then I load the trajectories with the features: data_2009_glycosylated = pyemma.coordinates.load(trajs1, features=feat_breathing_2009g) and each time I get the same error: TypeError: 1D weights expected when shapes of a and weights differ. Clearly, the error is some small syntax. Could you point out what's wrong? Thanks!

Best, Christian

clonker commented 3 years ago

Hi, could you send us a small test script and/or the full stacktrace? Thanks!

cgseitz commented 3 years ago

Sure, here is the full stacktrace. Let me know if you need anything else~

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-82-58a881b0ce7b> in <module>
      1 #lag time here doesn't really matter
      2 print("2009 glycosylated HA")
----> 3 data_2009_glycosylated = pyemma.coordinates.load(trajs1, features=feat_breathing_2009g)
      4 print('trajectory time step = .06 ns')
      5 print("data_2009_glycosylated has",len(data_2009_glycosylated), "trajectories")

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/api.py in load(trajfiles, features, top, stride, chunksize, **kw)
    242                  or len(trajfiles) is 0)):
    243         reader = create_file_reader(trajfiles, top, features, chunksize=cs, **kw)
--> 244         trajs = reader.get_output(stride=stride)
    245         if len(trajs) == 1:
    246             return trajs[0]

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/_base/datasource.py in get_output(self, dimensions, stride, skip, chunk)
    404             pg.register(it.n_chunks, description='getting output of %s' % self.__class__.__name__)
    405             with pg.context(), it:
--> 406                 for itraj, chunk in it:
    407                     i = slice(it.pos, it.pos + len(chunk))
    408                     assert i.stop - i.start > 0

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/_base/datasource.py in __next__(self)
   1047         self._select_file(self._itraj)
   1048         try:
-> 1049             X = self._use_cols(self._next_chunk())
   1050             self._t += len(X)
   1051         except StopIteration as e:

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/_base/datasource.py in _next_chunk(self)
   1169         # We discard the trajectory index here for transformation
   1170         if self.transform_function is not None:
-> 1171             x = self.transform_function(x)
   1172         return x
   1173 

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/feature_reader.py in transform(data)
    153                 return data
    154             else:
--> 155                 return self.featurizer.transform(data)
    156 
    157         it = FeatureReaderIterator(self, skip=skip, chunk=chunk, stride=stride, return_trajindex=return_trajindex,

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/featurization/featurizer.py in transform(self, traj)
    910                                         vec.shape[0]))
    911             else:
--> 912                 vec = f.transform(traj).astype(np.float32)
    913             feature_vec.append(vec)
    914 

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/featurization/misc.py in transform(self, traj)
    308             traj_copy = traj_copy.image_molecules()
    309         for aas, mms in zip(self.group_definitions, self.masses_in_groups):
--> 310             COM_xyz.append(np.average(traj_copy.xyz[:, aas, ], axis=1, weights=mms))
    311         return np.hstack(COM_xyz)
    312 

<__array_function__ internals> in average(*args, **kwargs)

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/numpy/lib/function_base.py in average(a, axis, weights, returned)
    395                     "differ.")
    396             if wgt.ndim != 1:
--> 397                 raise TypeError(
    398                     "1D weights expected when shapes of a and weights differ.")
    399             if wgt.shape[0] != a.shape[axis]:

TypeError: 1D weights expected when shapes of a and weights differ.
cgseitz commented 3 years ago

One more question, if I only wanted the center of masses of the alpha carbons of the selections, how would I incorporate that into a custom feature?

clonker commented 3 years ago

Having a closer look at your code, I think the problem is here:

COM_RBD1 = GroupCOMFeature(feat_breathing_2009g.topology, [[list(range(66,286))]])

It creates group definitions as a triply-nested list [[[66, 67, 68, ...]]] while the featurizer expects a list of groups, i.e., [[66, 67, 68, ...]]. I'm not sure I understand your second question properly, the GroupCOM feature already gives you the center of mass if you give it the indices of C alphas.

cgseitz commented 3 years ago

Hello,

When I change the code as you suggest

feat_breathing_2009g = pyemma.coordinates.featurizer(top1)

from pyemma.coordinates.data.featurization.misc import GroupCOMFeature
COM_RBD1 = GroupCOMFeature(feat_breathing_2009g.topology, [list(range(66,286))])
COM_central_helices_top = GroupCOMFeature(feat_breathing_2009g.topology, [[419, 985, 1551]])

def feature_1(traj: md.Trajectory):
    centers1 = COM_RBD1.transform(traj)  # yields ndarray
    centers2 = COM_central_helices_top.transform(traj)  # yields ndarray
    xyz = np.hstack((centers1, centers2))
    traj = md.Trajectory(xyz.reshape(-1, 3, 3), topology=None)
    # this has shape (n_frames, 1)
    distance1 = np.linalg.norm(np.subtract(centers1,centers2)) #does not return a numpy array
    return distance1

feat_breathing_2009g.add_custom_func(feature_1, dim=1, description='breathing')
data_2009_glycosylated = pyemma.coordinates.load(trajs1, features=feat_breathing_2009g)

The matrix does not reshape properly, into something that is appropriately divisible:

ValueError                                Traceback (most recent call last)
<ipython-input-14-58a881b0ce7b> in <module>
      1 #lag time here doesn't really matter
      2 print("2009 glycosylated HA")
----> 3 data_2009_glycosylated = pyemma.coordinates.load(trajs1, features=feat_breathing_2009g)
      4 print('trajectory time step = .06 ns')
      5 print("data_2009_glycosylated has",len(data_2009_glycosylated), "trajectories")

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/api.py in load(trajfiles, features, top, stride, chunksize, **kw)
    242                  or len(trajfiles) is 0)):
    243         reader = create_file_reader(trajfiles, top, features, chunksize=cs, **kw)
--> 244         trajs = reader.get_output(stride=stride)
    245         if len(trajs) == 1:
    246             return trajs[0]

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/_base/datasource.py in get_output(self, dimensions, stride, skip, chunk)
    404             pg.register(it.n_chunks, description='getting output of %s' % self.__class__.__name__)
    405             with pg.context(), it:
--> 406                 for itraj, chunk in it:
    407                     i = slice(it.pos, it.pos + len(chunk))
    408                     assert i.stop - i.start > 0

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/_base/datasource.py in __next__(self)
   1047         self._select_file(self._itraj)
   1048         try:
-> 1049             X = self._use_cols(self._next_chunk())
   1050             self._t += len(X)
   1051         except StopIteration as e:

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/_base/datasource.py in _next_chunk(self)
   1169         # We discard the trajectory index here for transformation
   1170         if self.transform_function is not None:
-> 1171             x = self.transform_function(x)
   1172         return x
   1173 

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/feature_reader.py in transform(data)
    153                 return data
    154             else:
--> 155                 return self.featurizer.transform(data)
    156 
    157         it = FeatureReaderIterator(self, skip=skip, chunk=chunk, stride=stride, return_trajindex=return_trajindex,

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/featurization/featurizer.py in transform(self, traj)
    890             if isinstance(f, CustomFeature):
    891                 # NOTE: casting=safe raises in numpy>=1.9
--> 892                 vec = f.transform(traj).astype(np.float32, casting='safe')
    893                 if vec.shape[0] == 0:
    894                     vec = np.empty((0, f.dimension))

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/featurization/misc.py in transform(self, traj)
    113 
    114     def transform(self, traj):
--> 115         feature = self._fun(traj, *self._args, **self._kwargs)
    116         if not isinstance(feature, np.ndarray):
    117             raise ValueError("your function should return a NumPy array!")

<ipython-input-13-2695c00a87f4> in feature_1(traj)
     13     centers2 = COM_central_helices_top.transform(traj)  # yields ndarray
     14     xyz = np.hstack((centers1, centers2))
---> 15     traj = md.Trajectory(xyz.reshape(-1, 3, 3), topology=None)
     16     # this has shape (n_frames, 1)
     17     #distance1 = np.linalg.norm(centers1 - centers2) #does not return a numpy array

ValueError: cannot reshape array of size 44178 into shape (3,3)

I am generating center of mass distances as 1D points; do I need to keep these 3D? For the other part, I guess I am not clear on whether the indices you give to the GroupCOMFeature will be read as residue indices or atom indices.

clonker commented 3 years ago

Regarding your stacktrace: It pretty much already says what is wrong with your code. You reshape

traj = md.Trajectory(xyz.reshape(-1, 3, 3), topology=None)

meaning that the coordinates are supposed to be of shape (time, 3 atoms, 3 spatial dimensions) which does not equally divide up the amount of data you got in xyz. Please check the outputs of your COM computations and reshape accordingly. That the distances are one-dimensional is not a problem although you might want to use some more features than just that, otherwise you are prone to large projection errors. Also in your implementation you evaluate

distance1 = np.linalg.norm(np.subtract(centers1,centers2)) #does not return a numpy array

which returns a single scalar. This is in all likelihood not what you want, rather you want to evaluate the COM distances in every frame. :slightly_smiling_face: In general I suggest running a debugger for these matters and checking the shape output step by step (or even good old print debugging).

cgseitz commented 3 years ago

Thanks! With your help I've made progress. I am trying to use the residue COM definition, where I can add the residue indices I need. However, the code syntax is not the same as the documentation:

COM_RBD2 = ResidueCOMFeature(feat_breathing_2009g.topology, residue_indices=np.array([[632,852]]), residue_atoms=np.array([[632,852]]), scheme='all'

TypeError                                 Traceback (most recent call last)
<ipython-input-77-0f7e165c1eaa> in <module>
     23 #COM_RBD2 = ResidueCOMFeature(feat_breathing_2009g.topology, [list(range(632,852))], scheme='all')
     24 #COM_RBD2 = ResidueCOMFeature(feat_breathing_2009g.topology, [[632, 852]], [[632,842]], scheme='all')
---> 25 COM_RBD2 = ResidueCOMFeature(feat_breathing_2009g.topology, residue_indices=np.array([[632,852]]), residue_atoms=np.array([[632,852]]), scheme='all')
     26 COM_RBD3 = ResidueCOMFeature(feat_breathing_2009g.topology, [list(range(1198,1418))])
     27 COM_central_helices_top = ResidueCOMFeature(feat_breathing_2009g.topology, [[419, 985, 1551]])

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/featurization/misc.py in __init__(self, topology, residue_indices, residue_atoms, scheme, ref_geom, image_molecules, mass_weighted)
    339         for ri in self.residue_indices:
    340             for coor in 'xyz':
--> 341                 self._describe.append('%s COM-%s (%s)' % (topology.residue(ri), coor, self.scheme))
    342                 # TODO consider including the ref_geom and image_molecule arsg here?
    343 

~/miniconda3/envs/pyemma/lib/python3.9/site-packages/mdtraj/core/topology.py in residue(self, index)
    808             The `index`-th residue in the topology.
    809         """
--> 810         return self._residues[index]
    811 
    812     @property

TypeError: only integer scalar arrays can be converted to a scalar index

What is the difference between residue_indices and residue_atoms, considering both are required arguments here? How do I appropriately call them? It appears only one is requested in the documentation. http://www.emma-project.org/latest/api/generated/pyemma.coordinates.featurizer.html#pyemma.coordinates.data.featurization.featurizer.MDFeaturizer.add_residue_COM

clonker commented 3 years ago

Hi, glad to hear it. The syntax is exactly as in the documentation :wink:

What you are trying to do is creating a new instance of ResidueCOMFeature directly, which indeed takes atom indices (of the atoms present in the residues). What you are linking though is the add_residue_COM method of the Featurizer class. Here indeed a ResidueCOMFeature instance is internally created, the atom indices are fetched automatically though. At the very top of the documentation website you linked is an example on how to create/use the featurizer.

cgseitz commented 3 years ago

Thanks for clarifying, it now makes sense how ResidueCOMFeature is a part of add_residue_COM. I can see how the ResidueCOMFeature may not be able to recognize atoms if the atom indices don't correspond to a residue I have also indexed. With this, I have created my features, many thanks! Have you thought about giving more examples for feature creation on the pyemma page, especially for custom features? I would certainly be willing to share the features I've made if it can help the community.

clonker commented 3 years ago

Great! I think examples are always good, I am looking forward to a PR!