markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
311 stars 119 forks source link

Saving features in file #1354

Closed and-tos closed 6 years ago

and-tos commented 6 years ago

I am trying to save the feature list X in a file

feat = coor.featurizer(topfile)
n_res = 165 # number of residues
ben_ind = 209 # ligand residue index (with residues starting at 1)
ind_arr = np.zeros((n_res,2))
for i in range(n_res):
    ind_arr[i][0] = ben_ind-1
    ind_arr[i][1] = i
feat.add_residue_mindist(residue_pairs=ind_arr, scheme='closest-heavy', threshold=0.5)

# short version: X = coor.load(traj_list, feat)
inp = coor.source(traj_list, feat)
X = inp.get_output()

When running TICA and clustering with kmeans, the output looks like this. cluster To save/load the output of X, which I understand is a list of arrays I run the following:

np.savez('features', X)

#create a new list of arrays from 'features.npz' by adding each array to a list
X=[]
for i in np.load('features.npz')['arr_0']:
    X.append(i)

Which gives me this output:

cluster_npload

How do I save and load X correctly?

I am running pyemma 2.5.2 from conda.

thempel commented 6 years ago

I'd guess that X = np.load('features.npz')['arr_0'] would do the job. If it doesn't: What exactly are you plotting?

and-tos commented 6 years ago

This is the error I get when using X = np.load('features.npz')['arr_0'] and calling

tica_lag = 10  # tica lagtime

tica_obj = coor.tica(X, lag=tica_lag)
raise TypeError('Argument traj is not a trajectory - only float-arrays or list of float-arrays are allowed. Types is %s' % type(traj))
TypeError: Argument traj is not a trajectory - only float-arrays or list of float-arrays are allowed. Types is <class 'numpy.ndarray'>

This is what I plot:

tica_lag = 10  # tica lagtime
tica_obj = coor.tica(X, lag=tica_lag)
Y = tica_obj.get_output()  # get tica coordinates

n_clusters = 700
clustering = coor.cluster_kmeans(Y, k=n_clusters, max_iter=100, tolerance=1e-10, fixed_seed=True)
dtrajs = clustering.dtrajs  # get discrete trajectories

Dall = np.concatenate(dtrajs)
Xall = np.vstack(X)
distogram = np.zeros(n_clusters)
for i in range(n_clusters):
    Xsub = Xall[np.where(Dall==i)[0]]
    distogram[i] = Xsub.sum() / Xsub.shape[0]

histogram = np.bincount(np.concatenate(dtrajs), minlength=len(clustering.clustercenters));
plt.scatter(histogram, distogram)
plt.semilogx(); plt.xlabel('number of frames'); plt.ylabel('number of contacts')
save_figure('cluster.png')
thempel commented 6 years ago

Numpy converts everything to numpy arrays when saving it to disc, so once you load it, it's not a python list anymore. Have a look at type(X) before and after you save/loaded it. On the other hand, TICA expects a list of arrays or a single array (as the error message says or the documentation). So I have to correct my previous answer, most probably you need to load X with `X = list(np.load('features.npz')['arr_0'])'. This should convert the object that you have now into a list of numpy arrays.

You could debug this by comparing your X before and after you load it. The length of the list of your original X, so len(X), might correspond to the first dimension of your loaded X, so X.shape.

and-tos commented 6 years ago

This worked! Thanks a lot!