custom feature without mdtraj analysis functions

germanbarcenas commented 3 years ago

Hello,

Sorry if this is a repeat of similar questions. I notice when people make custom functions, it tends to always end up using some mdtraj analysis with some tweaks, like maybe the distance of center of mass that end up combining existing analysis into something new. I want to make something that is completely new to use in some pyemma code. I can call this value an orientation factor that tells us how molecules will align with respect to another. It involves me selecting some atoms to define vectors that will tell me about the molecules alignment. Here is an example of how that would look like:

topology = mdtraj.load_frame(XTC,0,top=PDB).topology

r1=topology.select("resid %i and name'C9'"%dyeIndex[0])
s1=topology.select("resid %i and name'C20'"%dyeIndex[0])
r2=topology.select("resid %i and name'C9'"%dyeIndex[1])
s2=topology.select("resid %i and name'C20'"%dyeIndex[1])

Here I've selected the atoms named C9 and C20 of my residues of interest to create the vectors. This works fine. Now this is how the analysis would look like:

def kappa(traj: mdtraj.Trajectory):
    Long1 = np.subtract(r1,s1)
    UnitLong1 = Long1/np.linalg.norm(Long1)
    Long2 = np.subtract(r2,s2)
    UnitLong2 = Long2/np.linalg.norm(Long2)

    CoM1 = np.add(r1,s1)/2
    CoM2 = np.add(r2,s2)/2

    MagCenterToCenter = np.linalg.norm(np.subtract(CoM2,CoM1))
    UnitCenterToCenter = np.subtract(CoM2,CoM1)/MagCenterToCenter

    kappa = np.dot(UnitLong1,UnitLong2) - 3*(np.dot(UnitLong1,UnitCenterToCenter)*(np.dot(UnitCenterToCenter,UnitLong2)))
   return kappa

This works well in other instances where I might just append these values to an array that looks through the trajectory for each frame. I want to return this as a feature like this:

feat.add_custom_func(kappa,dim=1,description='Orientation Factor')

My question is, how do I return kappa as the ndarray that is needed for pyemma to accept this as a feature? I notice that custom features always have some moment where it invokes the trajectory to make an array of values for each frame. I think me I should be doing this to grab that r1,r2,s1,s2 values. My first thought is to use the traj.xyz function from mdtraj, and select those using traj.xyz[:,r1,:] functions, but I'm not really sure how I would use them after. I'm I on the right track? I would image the shape of the kappa feature should be (-1,1) right, because I should just be getting 1 kappa for each frame. So my first pass at this issue of returning an ndarray looks like this:

def kappa(traj: mdtraj.Trajectory):

    kappas=np.zeros((traj.n_frames,1))
    Long1 = np.subtract(r1,s1)
    UnitLong1 = Long1/np.linalg.norm(Long1)
    Long2 = np.subtract(r2,s2)
    UnitLong2 = Long2/np.linalg.norm(Long2)

    CoM1 = np.add(r1,s1)/2
    CoM2 = np.add(r2,s2)/2

    MagCenterToCenter = np.linalg.norm(np.subtract(CoM2,CoM1))
    UnitCenterToCenter = np.subtract(CoM2,CoM1)/MagCenterToCenter

    kappa = np.dot(UnitLong1,UnitLong2) - 3*(np.dot(UnitLong1,UnitCenterToCenter)*(np.dot(UnitCenterToCenter,UnitLong2)))
    kappas.append(kappa)
    return kappas

and here is the error I get:

Traceback (most recent call last):

  File "R:\GermanBarcenas\dna\gromacs\codes\markovAnalysis.py", line 933, in <module>
    tICAtraj=pyemma.coordinates.load(XTC,features=tICAfeat)

  File "C:\Users\GermanBarcenas\Anaconda3\envs\moldym\lib\site-packages\pyemma\coordinates\api.py", line 244, in load
    trajs = reader.get_output(stride=stride)

  File "C:\Users\GermanBarcenas\Anaconda3\envs\moldym\lib\site-packages\pyemma\coordinates\data\_base\datasource.py", line 406, in get_output
    for itraj, chunk in it:

  File "C:\Users\GermanBarcenas\Anaconda3\envs\moldym\lib\site-packages\pyemma\coordinates\data\_base\datasource.py", line 1049, in __next__
    X = self._use_cols(self._next_chunk())

  File "C:\Users\GermanBarcenas\Anaconda3\envs\moldym\lib\site-packages\pyemma\coordinates\data\_base\datasource.py", line 1171, in _next_chunk
    x = self.transform_function(x)

  File "C:\Users\GermanBarcenas\Anaconda3\envs\moldym\lib\site-packages\pyemma\coordinates\data\feature_reader.py", line 155, in transform
    return self.featurizer.transform(data)

  File "C:\Users\GermanBarcenas\Anaconda3\envs\moldym\lib\site-packages\pyemma\coordinates\data\featurization\featurizer.py", line 906, in transform
    vec = f.transform(traj).astype(np.float32, casting='safe')

  File "C:\Users\GermanBarcenas\Anaconda3\envs\moldym\lib\site-packages\pyemma\coordinates\data\featurization\misc.py", line 115, in transform
    feature = self._fun(traj, *self._args, **self._kwargs)

  File "R:\GermanBarcenas\dna\gromacs\codes\markovAnalysis.py", line 79, in kappa
    kappas.append(kappas,kappa,axis=-1)

AttributeError: 'numpy.ndarray' object has no attribute 'append'

Have I even started this approach correctly?

Thanks,

pyemma v 2.5.9

clonker commented 3 years ago

Hi,

in principle there is nothing wrong with your approach. Some hints: As per error message, ndarray (what you return in your computation) does not have an append. Instead, you will want to use a sliced assign like kappas[:] = kappa. Also your idea of the shape being (-1, 1) looks correct to me.

In general I would say this is almost a functioning feature. :slightly_smiling_face:

Best Moritz

germanbarcenas commented 3 years ago

Thank you @clonker. That slice notation was helpful. I am very close, but now I'm dealing with a problem I wonder you might provide 1 more hint to solve (:

So here are my changes:

#%%
t=mdtraj.load(XTC,top=PDB)

r1t=t.xyz[:,r1,:]
s1t=t.xyz[:,s1,:]
r2t=t.xyz[:,r2,:]
s2t=t.xyz[:,s2,:]
kappas=np.zeros((t.n_frames,1))
Long1 = np.subtract(r1t,s1t)
UnitLong1 = Long1/np.linalg.norm(Long1)
UnitLong1Squeeze=np.squeeze(UnitLong1)
Long2 = np.subtract(r2t,s2t)
UnitLong2 = Long2/np.linalg.norm(Long2)
UnitLong2Squeeze=np.squeeze(UnitLong2)

CoM1 = np.add(r1t,s1t)/2
CoM2 = np.add(r2t,s2t)/2

MagCenterToCenter = np.linalg.norm(np.subtract(CoM2,CoM1))
UnitCenterToCenter = np.subtract(CoM2,CoM1)/MagCenterToCenter
UnitCenterToCenterSqueeze=np.squeeze(UnitCenterToCenter)

u1dotu2=np.dot(UnitLong1Squeeze,UnitLong2Squeeze.T)
u1dotd=np.dot(UnitLong1Squeeze,UnitCenterToCenterSqueeze.T)
u2dotd=np.dot(UnitLong1Squeeze,UnitCenterToCenterSqueeze.T)
k=u1dotu2-3*(u1dotd*u2dotd)

kappas[:]=k

and from here I get the following error. In fact its the 1st time I've ever seen this one:

Traceback (most recent call last):

  File "R:\GermanBarcenas\dna\gromacs\codes\markovAnalysis.py", line 950, in <module>
    u2dotd=np.dot(UnitLong1Squeeze,UnitCenterToCenterSqueeze.T)

  File "<__array_function__ internals>", line 5, in dot

MemoryError: Unable to allocate 21.6 GiB for an array with shape (76134, 76134) and data type float32

Which tells me that these dot products are not eating up all of my memory. You might notice my transpose I added to those vectors. I did this because I was getting this before:

Traceback (most recent call last):

  File "R:\GermanBarcenas\dna\gromacs\codes\markovAnalysis.py", line 951, in <module>
    k = np.dot(UnitLong1Squeeze,UnitLong2Squeeze) - 3*(np.dot(UnitLong1Squeeze,UnitCenterToCenterSqueeze)*(np.dot(UnitCenterToCenterSqueeze,UnitLong2Squeeze)))

  File "<__array_function__ internals>", line 5, in dot

ValueError: shapes (76134,3) and (76134,3) not aligned: 3 (dim 1) != 76134 (dim 0)

So now my problem is dealing with the large arrays that are created when processing the whole trajectory. Are there any more efficient strategies in either pyemma or mdtraj to work through this problem more efficiently? There must be because dihedral angles also have to be doing some vector manipulation, likely with dot products, and those are features I have used that don't crash memory.

germanbarcenas commented 3 years ago

I realize that I wasn't using arrays at all correctly. I have fixed it and now this returns the correct feature. Normally I would just edit out the mistake but I'll leave it up for people to learn from my mistakes (:

def kappa(traj: mdtraj.Trajectory):

    r1t=t.xyz[:,r1,:]
    s1t=t.xyz[:,s1,:]
    r2t=t.xyz[:,r2,:]
    s2t=t.xyz[:,s2,:]
    kappas=np.zeros((t.n_frames,1))
    Long1=[]
    Long2=[]
    CoM1=[]
    CoM2=[]
    UnitLong1=[]
    UnitLong2=[]
    MagCenterToCenter=[]
    UnitCenterToCenter=[]
    k=[]
    for idx,val in enumerate(kappas):
        Long1.append(np.subtract(r1t[idx],s1t[idx]))
        Long2.append(np.subtract(r2t[idx],s2t[idx]))
        UnitLong1.append(Long1[idx]/np.linalg.norm(Long1[idx]))
        UnitLong2.append(Long2[idx]/np.linalg.norm(Long2[idx]))
        CoM1.append(np.add(r1t[idx],s1t[idx])/2)
        CoM2.append( np.add(r2t[idx],s2t[idx])/2)
        MagCenterToCenter.append(np.linalg.norm(np.subtract(CoM2[idx],CoM1[idx])))
        UnitCenterToCenter.append(np.subtract(CoM2[idx],CoM1[idx])/MagCenterToCenter[idx])
        k.append(np.dot(UnitLong1[idx],UnitLong2[idx].T) - 3*(np.dot(UnitLong1[idx],UnitCenterToCenter[idx].T)*(np.dot(UnitCenterToCenter[idx],UnitLong2[idx].T))))

    orientationFactor=np.asarray(k).reshape(-1,1)
    return orientationFactor

clonker commented 3 years ago

thanks @germanbarcenas! i'll go ahead and close this issue then.

markovmodel / PyEMMA

custom feature without mdtraj analysis functions #1515