markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
311 stars 119 forks source link

PyEMMA-Custom Feature #1467

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hello,

I'm new to PyEMMA and I am trying to build an eRMSD MSM for an RNA system. (eRMSD is calculated using BARNABA - python package). This is the code I'm trying to use to define my custom feature:

`import pyemma import pyemma.coordinates as coor import glob import matplotlib.pyplot as plt import barnaba as bb import numpy as np

ref = 'reference.pdb' traj_list = glob.glob("trajectory-*.nc") top = 'topology.prmtop'

print (traj_list)

create reader

reader = coor.source(traj_list, top=top)

add feature ***

ermsd = coor.data.CustomFeature((bb.ermsd(ref,f,topology=top)[0] for f in traj_list), dim=1647) reader.featurizer.add_custom_feature(ermsd)

perform tica on feature space

tica = coor.tica(reader)`

It all seems to work fine until I try to perform tica. When I do, I get the error "TypeError: 'generator' object is not callable".

I think it might be the way I'm defining my custom feature, but I don't know how to fix it.

Your help is greatly apprecieated,

Tia

clonker commented 4 years ago

Hi Tia, the custom feature feature expects a function as argument into which MDTraj trajectory objects (frames) are passed. In that spirit can you try something like this?

ermsd = coor.data.CustomFeature(lambda traj: bb.ermsd_traj(ref, traj))

Not sure what the dim=1647 is doing there, though.

Cheers, Moritz

ghost commented 4 years ago

Hello,

I tried the suggestion with a few additions. Here is the code:

`import pyemma import glob import pyemma.coordinates as coor import numpy as np import barnaba as bb

ref = 'asl-reference.pdb' top = 'asl-g34.prmtop' trajs = glob.glob("asl-g34-*.nc")

print (trajs)

reader = coor.source(trajs, top=top)

ermsd = coor.data.CustomFeature(lambda traj: bb.ermsd(ref, trajs, topology=top), dim=1647) reader.featurizer.add_custom_feature(ermsd) data = reader.get_output()

tica = coor.tica(reader) ` It seems to work fine, as the trajectories are loaded in barnaba. However, I now get this error:

TypeError: Cannot cast array from dtype('float64') to dtype('float32') according to the rule 'safe'

PS: I added dim=1647 to the command because otherwise, I get this error

TypeError: init() missing 1 required positional argument: 'dim'

To calculate dimensions, I simply multiplied the number of atoms in my system by 3.

Thank you for your suggestions,

Tia

clonker commented 4 years ago

Hi, could you post the full stack trace? Thanks!

ghost commented 4 years ago

I'm not quite sure what that is, but here is a picture of my screen.

Screen Shot 2020-08-03 at 3 04 56 PM
clonker commented 4 years ago

Oh you showed exactly the right thing, what I meant with stack trace is what is written under Traceback in your screenshot. Can you try the following:

ermsd = coor.data.CustomFeature(lambda traj: bb.ermsd(ref, trajs, topology=top).astype(np.float32), dim=1647)
clonker commented 4 years ago

And regarding the dim: I am not sure that a dim=1647 is the right thing here. What you want to put there is the output dimension of the ermsd computation. I assume that it is a scalar quantity, i.e., 1? So when you compute bb.ermsd(ref, trajs, topology=top) I would expect a resulting shape of (length-of-trajectory, 1) or equivalent.

clonker commented 4 years ago

Argh and I just realized that the lambda definition is not correct like this. It should be

ermsd = coor.data.CustomFeature(lambda traj: bb.ermsd(ref, traj, topology=top).astype(np.float32), dim=1647)

(note the traj rather than trajs in the ermsd call)

clonker commented 4 years ago

So i fixed something up for you using the sample data of BARNABA:

import pyemma
import numpy as np
import barnaba as bb
import pyemma.coordinates as coor
import mdtraj

top = 'sample1.pdb'
traj = 'samples.xtc'

reader = coor.source([traj], top=top)
ref = mdtraj.load_frame(traj, 0, top=top)  # first frame of traj

def ermsd_feature(traj):
    result = bb.ermsd_traj(ref, traj)
    return result.astype(np.float32).reshape(-1, 1)
ermsd = coor.data.CustomFeature(ermsd_feature, dim=1)
reader.featurizer.add_custom_feature(ermsd)

data = reader.get_output()

This should run through without any errors. BUT since it seems like you want to apply TICA afterwards I am guessing you want to find out something about the kinetics of your trajectory. For this matter eRMSD is probably not enough as a feature, consider adding more. If not you will end up with a one-dimensional trajectory.

ghost commented 4 years ago

This works if I only load 1 trajectory, but I have 30 of them. I have tried using glob to load them all, but the ermsd calculation fails, with this error:

Screen Shot 2020-08-04 at 8 47 13 AM

The same thing happens when I try to load them as a list. How should I proceed?

clonker commented 4 years ago

That is probably because of how you pass the trajectories. My guess is that you pass a list of lists somehow, in which case pyemma will use a FragmentedTrajectoryReader. You will want to pass a list of file names instead.

clonker commented 4 years ago

So if you just took the code i posted above, make sure to use something like

trajs = glob.glob(...)
reader = coor.source(trajs, top=top)

and not coor.source([trajs], top=top).

ghost commented 4 years ago

Thank you so much :)

ghost commented 4 years ago

I have tried adding other features but it fails.

Screen Shot 2020-08-04 at 3 45 26 PM
clonker commented 4 years ago

That is because it didn't find any phi/psi backbone torsions in the PDB. You can pass a custom selection string to use a specific set of backbone angles that is not phi/psi. You can also try to instantiate a featurizer directly, as described here although it probably does (and should) not make a difference.

thempel commented 4 years ago

Hi Tia, you wrote you are working with RNA systems. Pyemma's featurizer is generally designed for protein systems, i.e. method names refer to protein definitions of e.g. what are backbone atoms and how to compute the torsion angles between them. So calling featurizer.add_backbone_torsions() won't work because it doesn't find atoms that fit this definition ('empty indices'). I think that going via custom feature functions as you did before is a good way to implement feature functions in this case.

ghost commented 4 years ago

I will try that, thanks

ghost commented 4 years ago

Hello, According to the previous comment, I decided to add another custom feature from barnaba, namely g-vectors. Unlike eRMSD that yields a 1-D array, the g-vector calculation yields a 4-D array with dimension (m,n,n,4), where m is the number of frames loaded, and n is the number of nucleotides and it looks like this:

Screen Shot 2020-08-05 at 9 57 19 PM

To insert this custom feature, I want to flatten the array to (m,nn4), so I used this definition for gvec feature:

Screen Shot 2020-08-05 at 10 46 08 PM

This is what happens when I load this feature in PyEMMA:

Screen Shot 2020-08-05 at 10 49 31 PM Screen Shot 2020-08-05 at 10 49 49 PM

I am not sure what the problem is.

Thank you.

clonker commented 4 years ago

That is because dump_gvec expects a file, not a trajectory python object. For this, try dump_gvec_traj. You can see this by looking at the stack trace, it says `.../barnaba/functions.py, line 168, in dump_gvec", and having a look at that particular file at that particular line reveals that it tries to load() a file instead of extracting g vectors from an object.

Also the reshape should be the other way around, i.e., .reshape(-1, n*n*4); this is because PyEMMA might load the trajectory chunk-wise but the latter dimension will always stay the same (just the length might change). Or even: traj.reshape(traj.xyz.shape[0], -1).

ghost commented 4 years ago

This is the error I get when I try the first suggestion:

Screen Shot 2020-08-06 at 7 10 14 AM

I'm not sure how to implement the second suggestion. This is how I did it:

Screen Shot 2020-08-06 at 7 22 27 AM
clonker commented 4 years ago

Even though this is not really a pyemma issue but rather a barnaba thing: dump_gvec_traj returns a tuple of two where the first element is matrix you are after and the second element is the list of residues. To make it work you have to first extract the matrix like result = bb.dump_gvec_traj(traj)[0]. The thing to remember with custom features is really that the input is a MDTraj trajectory object and the output is expected to be a NumPy array of shape (trajectory_length, feature_dimensions) and dtype np.float32. Beyond that you are free to do whatever with the data you got at hand and it should work with the pyemma pipeline. The second suggestion is just how to do the reshape differently, both implementations should do the same in principle. Anyhow the first implementation looks cleaner and also correct (up to extraction of the tuple element and an I think unneccessary np.asarray(...)) to me.

ghost commented 4 years ago

Thank you