markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
311 stars 119 forks source link

how to ? #1352

Closed naveenmeena584 closed 6 years ago

naveenmeena584 commented 6 years ago

Thanks

cwehmeyer commented 6 years ago

Let's look at the tutorial notebook you mentioned.

The call

pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')

ensures that the file alanine-dipeptide-nowater.pdb exists in the directory data and returns the relative path as a string. If you know that the file already exists, you could also (on a Linux/Unix/OSX system) write

pdb = 'data/alanine-dipeptide-nowater.pdb'

Likewise,

files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.xtc', working_directory='data')

would be equivalent to

files = [
    'data/alanine-dipeptide-0-250ns-nowater.xtc',
    'data/alanine-dipeptide-1-250ns-nowater.xtc',
    'data/alanine-dipeptide-2-250ns-nowater.xtc']

And that is exactly the kind of information you need to pass to pyemma's loading functions: the relative or absolute paths of your files as strings.

Once you have the location of your PDB file stored in the variable pdb and the location of one or more trajectories in the variable files, you can create a featurizer

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(periodic=False) # load only backbone torsions

and load the selected molecular features into memory

data = pyemma.coordinates.load(files, features=feat)

or create a reader object (recommended for huge data sets)

reader = pyemma.coordinates.source(files, features=feat)
naveenmeena584 commented 6 years ago

its working but got another error i following https://github.com/markovmodel/pyemma_tutorials/blob/master/notebooks/01-data-io-and-featurization.ipynb this tutorial and @ this step

data_concatenated = np.concatenate(data) pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat);

getting following error:

IndexError Traceback (most recent call last)

in () 1 data_concatenated = np.concatenate(data) ----> 2 pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat); ~/miniconda3/envs/lib/python3.6/site-packages/pyemma/plots/plots1d.py in plot_feature_histograms(xyzall, feature_labels, ax, ylog, outfile, n_bins, ignore_dim_warning, **kwargs) 64 raise ValueError('Input data hast to be a numpy array. Did you concatenate your data?') 65 ---> 66 if xyzall.shape[1] > 50 and not ignore_dim_warning: 67 raise RuntimeError('This function is only useful for less than 50 dimensions. Turn-off this warning ' 68 'at your own risk with ignore_dim_warning=True.') IndexError: tuple index out of range
cwehmeyer commented 6 years ago

Yes, that exception is raised if you want to plot the histograms of more than 50 features. You can either plot your features in batches, e.g., via

pyemma.plots.plot_feature_histograms(data_concatenated[:, 0:10])
pyemma.plots.plot_feature_histograms(data_concatenated[:, 10:20])
...

or use the option mentioned in the Traceback to suppress the exception:

pyemma.plots.plot_feature_histograms(
    data_concatenated, feature_labels=feat, ignore_dim_warning=True)

The latter, however, will most likely result in a completely unusable figure.

thempel commented 6 years ago

Actually, can you show us what data_concatenated.shape returns? I suspect that this array is not set up correctly.

cwehmeyer commented 6 years ago

Yes, you are right @thempel, I misread the Traceback.

naveenmeena584 commented 6 years ago

type of data: <class 'numpy.ndarray'> lengths: 250000 shape of elements: (2,)

naveenmeena584 commented 6 years ago

alanine-dipeptide-0-250ns-nowater.xtc and alanine-dipeptide-nowater.pdb

thempel commented 6 years ago

Thanks, unfortunately we are still having problems to follow you. Could you please provide the code that you are trying to run? A minimal example would be great so we can reproduce the issue.

It might be possible that, if you have only one single trajectory, you should not concatenate the data. If that is the case, try to use the original data instead of the concatenated data. Concatenation only makes sense if you have multiple trajectories that you need to concatenate e.g. for histogram plotting.

naveenmeena584 commented 6 years ago

actually i want to do analysis for my simulation files for this jupyter code https://github.com/markovmodel/deeptime/blob/master/vampnet/examples/Alanine_dipeptide_multiple_files.ipynb i have confusion for this

Download alanine coordinates and dihedral angles data

mdshare.load('alanine-dipeptide-3x250ns-heavy-atom-positions.npz') mdshare.load('alanine-dipeptide-3x250ns-backbone-dihedrals.npz') alanine_files = np.load('alanine-dipeptide-3x250ns-heavy-atom-positions.npz')

how you get following 2 files for heavy atom position and bacbone dihedral in npz format. is it necessary to take 3 files currently i have one xtc and one pdb file so how i can get npz file and can use this code.

cwehmeyer commented 6 years ago

OK, a few points on this:

mdshare.load('alanine-dipeptide-3x250ns-heavy-atom-positions.npz') mdshare.load('alanine-dipeptide-3x250ns-backbone-dihedrals.npz')

mdshare.load() is deprecated and will not work if you have the latest version of mdshare. Please use mdshare.fetch() instead.

how you get following 2 files for heavy atom position and bacbone dihedral in npz format. is it necessary to take 3 files currently i have one xtc and one pdb file so how i can get npz file and can use this code.

The functions pyemma.coordinates.featurizer() and pyemma.coordinates.load() are used to extract molecular features (e.g., backbone dihedrals or heavy atom positions) from files which are stored in one of the usual molecular dynamics formats (e.g., .xtc or .dcd).

In the vampnet example you mentioned, we are using precomputed molecular features. In detail, we have run the code

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(periodic=False)
data = pyemma.coordinates.load(files, features=feat)
np.savez('alanine-dipeptide-3x250ns-backbone-dihedrals.npz', *data)

to extract the backbone dihedrals from the three .xtc files and saved the resulting three numpy.ndarrays in the file alanine-dipeptide-3x250ns-backbone-dihedrals.npz.

Now, if we want to run a vampnet calculation using backbone dihedrals, we can load this precomputed data via

with np.load('alanine-dipeptide-3x250ns-backbone-dihedrals.npz') as fh:
    data = [fh['arr_0'], fh['arr_1'], fh['arr_2']]

Unfortunately, pyemma cannot directly read .npz or .npy files and, thus, we use numpy to load the data into memory; this is explained in https://github.com/markovmodel/pyemma_tutorials/blob/master/notebooks/01-data-io-and-featurization.ipynb, Case 1: preprocessed data (toy model).

naveenmeena584 commented 6 years ago

i got that but you have not clear another doubt that is you used 3 xtc files becuse of that we have 3 npy files to use in this code :

Save the files separately

np.save('traj0.npy', alanine_files['arr_0']) np.save('traj1.npy', alanine_files['arr_1']) np.save('traj2.npy', alanine_files['arr_2'])

Separate data files between training data and validation data

train_data_files_list = [ 'traj0.npy', 'traj1.npy', ]

valid_data_files_list = [ 'traj2.npy', ] my doubt is if i have only one npy file how i define train data and valid_data and if i have more than 3 npy file, is it necessary to take 3 files . another doubt is can i take .gro as a toplology file instead of .pdb file

thempel commented 6 years ago

Yes, you can use .gro files as topology file.

The number of files is arbitrary, you can structure the data as you like. The crucial part is that you subsample your data such that there is no overlap between training and validation data. In the above case, we had 3 independent trajectories and chose the first two for training and the third for validation. If you have multiple trajectories, you can take an arbitrary subset for training and the remainder for validation. If you have only a single trajectory, you need to subsample this trajectory into blocks.

Generally, this split does not require the data to be in different files. More information on this kind of splitting is provided in introductions about cross-validation. This should be explained in the PyEMMA tutorials that you already mentioned (notebook 00 and 01). If you have further issues with VAMPNets in particular, please consider opening an issue in the deeptime repository.

naveenmeena584 commented 6 years ago

Thanks a lot, understand everything only have last doubt that is as you mentioned above :"If you have only a single trajectory, you need to sub sample this trajectory into blocks. " how to do this do you have any example where you did for single trajectory.

cwehmeyer commented 6 years ago

Let us assume your single trajectory is loaded into the variable data. Then, running

n = len(data) // 2
data_train = data[:n]
data_validation = data[n:]

would split your trajectory into roughly equal sized parts which are not overlapping. This is a crude but simple example.

If you want a more elaborate example, please consider working through this block subsampling function from deeptime's time-lagged autoencoder project: https://github.com/markovmodel/deeptime/blob/f2b97328baa1c38c92616f058195fa5803ff05d9/time-lagged-autoencoder/tae/utils.py#L190-L211

naveenmeena584 commented 5 years ago

how to resolve this? TypeErrorTraceback (most recent call last)

in 5 output_size), 6 steps = np.sum(np.ceil((total_data_source.trajectory_lengths()-tau)/batch_size)), ----> 7 verbose = 0) 8 states_prob_t = states_prob_all[:,:output_size] 9 states_prob_lag = states_prob_all[:,output_size:] /opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in predict_generator(self, generator, steps, max_queue_size, workers, use_multiprocessing, verbose) 1534 workers=workers, 1535 use_multiprocessing=use_multiprocessing, docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi -> 1536 verbose=verbose) 1537 1538 def _get_callback_model(self): /opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training_generator.py in model_iteration(model, data, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch, mode, batch_size, **kwargs) 174 progbar.on_epoch_begin(epoch, epoch_logs) 175 --> 176 for step in range(steps_per_epoch): 177 batch_data = _get_next_batch(output_generator, mode) 178 if batch_data is None: TypeError: 'numpy.float64' object cannot be interpreted as an integer On Mon, Sep 3, 2018 at 2:42 PM Tim Hempel wrote: > Closed #1352 . > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > , or mute > the thread > > . >
thempel commented 5 years ago

Thank you very much for posting your tensorflow problem to the pyemma issue tracker and for putting so much efforts into formatting it. I assume you resolve this by not using floats but integers when calling range().

Wencesgiovanni commented 1 year ago

What is the difference between a working directory and a path?

I am at a loss.

I am trying to fetch the file pH10-amber-R1-dry.xtc via the command

files = fetch( 'pH10-amber-R1-dry.xtc', working_directory='C:/Users/giova/data/')

but I keep on obtaining the following error message:

pH10-amber-R1-dry.xtc [no match in repository]

I assure you that the file pH10-amber-R1-dry.xtc does belong to the directory /data. Why do I get the said message?

Thank you very much for you attentive reply!