Closed naveenmeena584 closed 6 years ago
Let's look at the tutorial notebook you mentioned.
The call
pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
ensures that the file alanine-dipeptide-nowater.pdb
exists in the directory data
and returns the relative path as a string. If you know that the file already exists, you could also (on a Linux/Unix/OSX system) write
pdb = 'data/alanine-dipeptide-nowater.pdb'
Likewise,
files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.xtc', working_directory='data')
would be equivalent to
files = [
'data/alanine-dipeptide-0-250ns-nowater.xtc',
'data/alanine-dipeptide-1-250ns-nowater.xtc',
'data/alanine-dipeptide-2-250ns-nowater.xtc']
And that is exactly the kind of information you need to pass to pyemma's loading functions: the relative or absolute paths of your files as strings.
Once you have the location of your PDB file stored in the variable pdb
and the location of one or more trajectories in the variable files
, you can create a featurizer
feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(periodic=False) # load only backbone torsions
and load the selected molecular features into memory
data = pyemma.coordinates.load(files, features=feat)
or create a reader object (recommended for huge data sets)
reader = pyemma.coordinates.source(files, features=feat)
its working but got another error i following https://github.com/markovmodel/pyemma_tutorials/blob/master/notebooks/01-data-io-and-featurization.ipynb this tutorial and @ this step
data_concatenated = np.concatenate(data) pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat);
IndexError Traceback (most recent call last)
Yes, that exception is raised if you want to plot the histograms of more than 50 features. You can either plot your features in batches, e.g., via
pyemma.plots.plot_feature_histograms(data_concatenated[:, 0:10])
pyemma.plots.plot_feature_histograms(data_concatenated[:, 10:20])
...
or use the option mentioned in the Traceback to suppress the exception:
pyemma.plots.plot_feature_histograms(
data_concatenated, feature_labels=feat, ignore_dim_warning=True)
The latter, however, will most likely result in a completely unusable figure.
Actually, can you show us what data_concatenated.shape
returns? I suspect that this array is not set up correctly.
Yes, you are right @thempel, I misread the Traceback.
type of data: <class 'numpy.ndarray'> lengths: 250000 shape of elements: (2,)
alanine-dipeptide-0-250ns-nowater.xtc and alanine-dipeptide-nowater.pdb
Thanks, unfortunately we are still having problems to follow you. Could you please provide the code that you are trying to run? A minimal example would be great so we can reproduce the issue.
It might be possible that, if you have only one single trajectory, you should not concatenate the data. If that is the case, try to use the original data instead of the concatenated data. Concatenation only makes sense if you have multiple trajectories that you need to concatenate e.g. for histogram plotting.
actually i want to do analysis for my simulation files for this jupyter code https://github.com/markovmodel/deeptime/blob/master/vampnet/examples/Alanine_dipeptide_multiple_files.ipynb i have confusion for this
mdshare.load('alanine-dipeptide-3x250ns-heavy-atom-positions.npz') mdshare.load('alanine-dipeptide-3x250ns-backbone-dihedrals.npz') alanine_files = np.load('alanine-dipeptide-3x250ns-heavy-atom-positions.npz')
how you get following 2 files for heavy atom position and bacbone dihedral in npz format. is it necessary to take 3 files currently i have one xtc and one pdb file so how i can get npz file and can use this code.
OK, a few points on this:
mdshare.load('alanine-dipeptide-3x250ns-heavy-atom-positions.npz') mdshare.load('alanine-dipeptide-3x250ns-backbone-dihedrals.npz')
mdshare.load()
is deprecated and will not work if you have the latest version of mdshare. Please use mdshare.fetch()
instead.
how you get following 2 files for heavy atom position and bacbone dihedral in npz format. is it necessary to take 3 files currently i have one xtc and one pdb file so how i can get npz file and can use this code.
The functions pyemma.coordinates.featurizer()
and pyemma.coordinates.load()
are used to extract molecular features (e.g., backbone dihedrals or heavy atom positions) from files which are stored in one of the usual molecular dynamics formats (e.g., .xtc
or .dcd
).
In the vampnet example you mentioned, we are using precomputed molecular features. In detail, we have run the code
feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(periodic=False)
data = pyemma.coordinates.load(files, features=feat)
np.savez('alanine-dipeptide-3x250ns-backbone-dihedrals.npz', *data)
to extract the backbone dihedrals from the three .xtc
files and saved the resulting three numpy.ndarrays
in the file alanine-dipeptide-3x250ns-backbone-dihedrals.npz
.
Now, if we want to run a vampnet calculation using backbone dihedrals, we can load this precomputed data via
with np.load('alanine-dipeptide-3x250ns-backbone-dihedrals.npz') as fh:
data = [fh['arr_0'], fh['arr_1'], fh['arr_2']]
Unfortunately, pyemma cannot directly read .npz
or .npy
files and, thus, we use numpy to load the data into memory; this is explained in https://github.com/markovmodel/pyemma_tutorials/blob/master/notebooks/01-data-io-and-featurization.ipynb, Case 1: preprocessed data (toy model).
i got that but you have not clear another doubt that is you used 3 xtc files becuse of that we have 3 npy files to use in this code :
np.save('traj0.npy', alanine_files['arr_0']) np.save('traj1.npy', alanine_files['arr_1']) np.save('traj2.npy', alanine_files['arr_2'])
train_data_files_list = [ 'traj0.npy', 'traj1.npy', ]
valid_data_files_list = [ 'traj2.npy', ] my doubt is if i have only one npy file how i define train data and valid_data and if i have more than 3 npy file, is it necessary to take 3 files . another doubt is can i take .gro as a toplology file instead of .pdb file
Yes, you can use .gro files as topology file.
The number of files is arbitrary, you can structure the data as you like. The crucial part is that you subsample your data such that there is no overlap between training and validation data. In the above case, we had 3 independent trajectories and chose the first two for training and the third for validation. If you have multiple trajectories, you can take an arbitrary subset for training and the remainder for validation. If you have only a single trajectory, you need to subsample this trajectory into blocks.
Generally, this split does not require the data to be in different files. More information on this kind of splitting is provided in introductions about cross-validation. This should be explained in the PyEMMA tutorials that you already mentioned (notebook 00 and 01). If you have further issues with VAMPNets in particular, please consider opening an issue in the deeptime repository.
Thanks a lot, understand everything only have last doubt that is as you mentioned above :"If you have only a single trajectory, you need to sub sample this trajectory into blocks. " how to do this do you have any example where you did for single trajectory.
Let us assume your single trajectory is loaded into the variable data
. Then, running
n = len(data) // 2
data_train = data[:n]
data_validation = data[n:]
would split your trajectory into roughly equal sized parts which are not overlapping. This is a crude but simple example.
If you want a more elaborate example, please consider working through this block subsampling function from deeptime's time-lagged autoencoder project: https://github.com/markovmodel/deeptime/blob/f2b97328baa1c38c92616f058195fa5803ff05d9/time-lagged-autoencoder/tae/utils.py#L190-L211
how to resolve this? TypeErrorTraceback (most recent call last)
Thank you very much for posting your tensorflow problem to the pyemma issue tracker and for putting so much efforts into formatting it. I assume you resolve this by not using floats but integers when calling range()
.
What is the difference between a working directory and a path?
I am at a loss.
I am trying to fetch the file pH10-amber-R1-dry.xtc via the command
files = fetch( 'pH10-amber-R1-dry.xtc', working_directory='C:/Users/giova/data/')
but I keep on obtaining the following error message:
pH10-amber-R1-dry.xtc [no match in repository]
I assure you that the file pH10-amber-R1-dry.xtc does belong to the directory /data. Why do I get the said message?
Thank you very much for you attentive reply!
Thanks