msultan / SML_CV

Using supervised machine learning to build collective variables for accelerated sampling
MIT License
27 stars 5 forks source link

How to obtain pkl files #3

Closed roadhouse23 closed 5 years ago

roadhouse23 commented 5 years ago

Hello, I succesfully executed the alanine example (01-svm_example.py). However, it's not clear to me how I can obtain the following pkl files from a general MD trajectory.

raw_features.pkl features.pkl feature_descriptor.pkl svm_model_2.pkl

Can you help me or can you indicate a step-by-step documentation to apply this very interesting procedure in a general case starting from a MD simulation trajectory?

Thank you very much

GG

msultan commented 5 years ago

Hi,

Take a look at the following script as a starting point

https://github.com/msultan/SML_CV/blob/master/alanine_example/train_data/feat.py

In the script you should swap out the trj_list line with the following code:

import mdtraj as md 
trj_list = [md.load("./traj1.xtc",top="prot.pdb"), md.load("./traj2.xtc",top="prot.pdb")]

where traj1.xtc is the trajectory in the starting state, and traj2.xtc is the trajectory in the end state, and what we want is a collective variable that will drive the transition between them.

The script will then dump out the dihedral features(raw_features.pkl), sin-cosine transformed dihedral features(features.pkl), and a description of the features (feature_descriptor.pkl).

The svm_models are generated in the ipython notebooks, and learn what combinition of features separate traj1.xtc from traj2.xtc. This then becomes the collective variable.

I can do a more complete tutorial if you need once I get back but that would likely be in the evening or tomorrow.

roadhouse23 commented 5 years ago

Thanks for the reply and for all your help.

After your suggestions I should be able to apply the procedure for a general MD trajectory.

All the best Gianvito

Il giorno gio 9 ago 2018 alle ore 17:21 msultan notifications@github.com ha scritto:

Hi,

Take a look at the following script as a starting point

https://github.com/msultan/SML_CV/blob/master/alanine_example/train_data/feat.py

In the script you should swap out the trj_list line with the following code:

import mdtraj as md trj_list = [md.load("./traj1.xtc",top="prot.pdb"), md.load("./traj2.xtc",top="prot.pdb")]

where traj1.xtc is the trajectory in the starting state, and traj2.xtc is the trajectory in the end state, and what we want is a collective variable that will drive the transition between them.

The script will then dump out the dihedral features(raw_features.pkl), sin-cosine transformed dihedral features(features.pkl), and a description of the features (feature_descriptor.pkl).

The svm_models are generated in the ipython notebooks, and learn what combinition of features separate traj1.xtc from traj2.xtc. This then becomes the collective variable.

I can do a more complete tutorial if you need once I get back but that would likely be in the evening or tomorrow.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/msultan/SML_CV/issues/3#issuecomment-411795962, or mute the thread https://github.com/notifications/unsubscribe-auth/AoQpz14-SjVaATpnKfOlhPCpOXqQM13-ks5uPFOLgaJpZM4V102R .

msultan commented 5 years ago

Also, you can try different featurizers instead of plain back bone dihedrals or even combinations of backbone, alpha carbon sidechain dihedrals and contacts of various kinds. The starting feature space is really system dependent but dihedrals + side chain dihedrals and contacts seem to work for most systems that i have seen.

However if you combine features like that, its best to normalize the features and the code should work with that too. However, I haven't documented that properly so I will try and writing a tutorial for it. For now, i would recommend starting with the MSMBuilder DihedralFeaturizer and seeing how the results look.

roadhouse23 commented 5 years ago

I don't know if the standard CVs are enough in my case, considering that I'm working with a relatively big protein. However, I just trying to familizrize myself with your code before starting the production experiments.

It seems that feat.py has produced the pkl files needed for the calculation, however when I try to run the code (01-svm.py), I obtain the following error:

"Populating the interactive namespace from numpy and matplotlib LinearSVC(C=1, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l1', random_state=None, tol=0.0001, verbose=0) /opt/python/3/lib/python3.6/site-packages/numpy/core/_methods.py:135: RuntimeWarning: Degrees of freedom <= 0 for slice keepdims=keepdims) /opt/python/3/lib/python3.6/site-packages/numpy/core/_methods.py:105: RuntimeWarning: invalid value encountered in true_divide arrmean, rcount, out=arrmean, casting='unsafe', subok=False) /opt/python/3/lib/python3.6/site-packages/numpy/core/_methods.py:127: RuntimeWarning: invalid value encountered in true_divide ret = ret.dtype.type(ret / rcount)

ValueError Traceback (most recent call last) /media/gianvito/ongoing/CV/01jd/01-svm.py in () 104 for j in range(ny): 105 X_val = np.array([np.sin(xv[i,j]),np.cos(xv[i,j]), np.sin(yv[i,j]), np.cos(yv[i,j])]).reshape(1,-1) --> 106 res.extend(clf.predict(X_val)) 107 #contourf(lim_x,lim_y,np.array(res).reshape(10,10),cmap='coolwarm') 108

/opt/python/3/lib/python3.6/site-packages/sklearn/linear_model/base.py in predict(self, X) 322 Predicted class label per sample. 323 """ --> 324 scores = self.decision_function(X) 325 if len(scores.shape) == 1: 326 indices = (scores > 0).astype(np.int)

/opt/python/3/lib/python3.6/site-packages/sklearn/linear_model/base.py in decision_function(self, X) 303 if X.shape[1] != n_features: 304 raise ValueError("X has %d features per sample; expecting %d" --> 305 % (X.shape[1], n_features)) 306 307 scores = safe_sparsedot(X, self.coef.T,

ValueError: X has 4 features per sample; expecting 680"

msultan commented 5 years ago

X_val is only for plotting the dihedrals of alanine dipeptide. It wont apply to your system. Take a look at how I visualized the chignolin example to see what dihedrals it picks out as being important for the transition.

msultan commented 5 years ago

closing now, feel free to reopen this issue as needed and Good luck with the simulations!