Dimension in the feature selection

andresilvapimentel commented 3 years ago

I loaded only one trajectory, but it was requested more output dimensions (10) (used the default value for dim = 10) than dimension of input data (1) in the feature selection. Do I need to load many trajectories? Pyemma randomly split the list of independent trajectories into a training and a validation set, compute the VAMP2 score, and repeat this process several times. However, what happens if I load only one trajectory? Is this the issue for the following error message?

RuntimeError Traceback (most recent call last)

in () 36 fig, axes = plt.subplots(1, 3, figsize=(12, 3), sharey=True) 37 for ax, lag in zip(axes.flat, [5, 10, 20]): ---> 38 torsions_scores = score_cv(torsions_data, lag=lag, dim=dim) 39 scores = [torsions_scores.mean()] 40 errors = [torsions_scores.std()] 5 frames /usr/local/lib/python3.7/dist-packages/pyemma/coordinates/transform/vamp.py in _estimate(self, iterable, **kw) 603 if isinstance(self.dim, int) and not self.dim <= indim: 604 raise RuntimeError("requested more output dimensions (%i) than dimension" --> 605 " of input data (%i)" % (self.dim, indim)) 606 607 if self._logger_is_active(self._loglevel_DEBUG): RuntimeError: requested more output dimensions (10) than dimension of input data (1)

clonker commented 3 years ago

If your data is only one-dimensional then you can't really project it into 10 dimensions, this is basically what the error message says. That doesnt have to do with the number of trajectories but rather with the molecule you are working with. If it only has one torsion angle, then that is what you get by using only torsions in the featurizer.

andresilvapimentel commented 3 years ago

This is a large protein that I generate the xtc file with only the backbone chain. I removed water, cations, and the rest of the protein leaving only the backbone. Is it ok to do that? After running with dimension equal 1, it worked it out and I found that the VAMP2 score of the backbone atoms distances is larger than 1.5 for lag time 0.5 ns and the other features returned VAMP2 score equal to 1, which corresponds to the invariant measure or equilibrium. However, the VAMP-2 score versus numer of dimentions plot has not a curve choosing only one dimension (dim = 1). This does not make sense, right? Something is wrong in the trajectory or topology, right?

andresilvapimentel commented 3 years ago

I also tried the xtc file with the all atom configuration instead of trying the xtc file with only the backbone configuration. Doing this, pyemma only worked with dimension equal 1 (dim = 1). Do you know what is going on? If you want to check the xtc and pdb files, I can share with you in a google drive. Thank you for helping me.

thempel commented 3 years ago

If you have a large protein and you only get one single torsion angle, there could be something broken with the topology (i.e., the PDB file). Does it know the right atom types? How are you setting up the featurizer and what is the output of featurizer.describe()?

andresilvapimentel commented 3 years ago

I set up like this:

torsions_feat = pyemma.coordinates.featurizer('spa_no_ca.pdb') torsions_feat.add_backbone_torsions(cossin=True, periodic=False) torsions_data = pyemma.coordinates.load(files, features=torsions_feat) labels = ['backbone\ntorsions']

positions_feat = pyemma.coordinates.featurizer('spa_no_ca.pdb') positions_feat.add_selection(positions_feat.select_Backbone()) positions_data = pyemma.coordinates.load(files, features=positions_feat) labels += ['backbone atom\npositions']

distances_feat = pyemma.coordinates.featurizer('spa_no_ca.pdb') distances_feat.add_distances( distances_feat.pairs(distances_feat.select_Backbone(), excluded_neighbors=2), periodic=False) distances_data = pyemma.coordinates.load(files, features=distances_feat) labels += ['backbone atom\ndistances']

andresilvapimentel commented 3 years ago

We tested pyemma using different pdb and xtc file, e.g., the lisozyme.pdb and lisozyme.xtc from the gromacs tutorial. We got the same error message about the dimension. And, it only worked it out using dim = 1. It seems we are doing something wrong generating the lisozyme.pdb and lisozyme.xtc files, I think. Could you explain me how to generate these files, please?

thempel commented 3 years ago

Well, you can upload your PDB file here and I will take a look. [Edit: Preferably the one that you want to use in the end.]

andresilvapimentel commented 3 years ago

Here are two pdb files that I have been using in pyemma. spa_no_ca.pdb is an all atoms pdb file and spa_no_ca_back.pdb is a pdb file with only the backbone heavy atoms. Thank you for helping me.

Em sex., 26 de mar. de 2021 às 06:04, Tim Hempel @.***> escreveu:

Well, you can upload your PDB file here and I will take a look.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/1496#issuecomment-808054396, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3PBBMR47CIQCTLNX6QFQDTFREZNANCNFSM4ZZGD3KQ .

andresilvapimentel commented 3 years ago

I also tried to use the lisozyme pdb file (1AKI pdb code) from the gromacs tutorial. You can download it by itself from the pdb database (https://www.rcsb.org/structure/1aki).

thempel commented 3 years ago

Email attachments do not get processed by GitHub, so I could only test the 1AKI pdf file downloaded from RCSB. That works, i.e., the featurizer tells me it's 512 dimensions with the code that you pasted above.

import pyemma
feat = pyemma.coordinates.featurizer('/tmp/1aki.pdb')
feat.add_backbone_torsions(cossin=True, periodic=False)
print(feat.dimension())
> 512

andresilvapimentel commented 3 years ago

So... Do you have any idea what is going on about the issue on 1AKI dimensionality in my code? The spa_no_ca.pdb and spa_no_ca_back.pdb files have 1736 dimensions each.

The feature selection (below) was done with 1 dimension only because it does not work with more than 1 dimension:

def score_cv(data, dim, lag, number_of_splits=10, validation_fraction=0.5):
    with pyemma.util.contexts.settings(show_progress_bars=False):
        nval = int(len(data) * validation_fraction)
        scores = np.zeros(number_of_splits)
        for n in range(number_of_splits):
            ival = np.random.choice(len(data), size=nval, replace=False)
            vamp = pyemma.coordinates.vamp(
                [d for i, d in enumerate(data) if i not in ival], lag=lag, dim=dim)
            scores[n] = vamp.score([d for i, d in enumerate(data) if i in ival])
    return scores

dim = 1

fig, axes = plt.subplots(1, 3, figsize=(12, 3), sharey=True)
for ax, lag in zip(axes.flat, [5, 10, 20]):
    torsions_scores = score_cv(torsions_data, lag=lag, dim=dim)
    scores = [torsions_scores.mean()]
    errors = [torsions_scores.std()]
    positions_scores = score_cv(positions_data, lag=lag, dim=dim)
    scores += [positions_scores.mean()]
    errors += [positions_scores.std()]
    distances_scores = score_cv(distances_data, lag=lag, dim=dim)
    scores += [distances_scores.mean()]
    errors += [distances_scores.std()]
    ax.bar(labels, scores, yerr=errors, color=['C0', 'C1', 'C2'])
    ax.set_title(r'lag time $\tau$={:.1f}ns'.format(lag * 0.1))
    if lag == 5:
        # save for later
        vamp_bars_plot = dict(
            labels=labels, scores=scores, errors=errors, dim=dim, lag=lag)
axes[0].set_ylabel('VAMP2 score')
fig.tight_layout()

Please, let me know if there is anything wrong with this code.

thempel commented 3 years ago

Maybe data is not a list of arrays but just a single array? Can you print type(data) and len(data)? I would guess that [d for i, d in enumerate(data) if i not in ival] is the root cause of the 1 dimension problem as it's splitting your single trajectory into single frames which are subsequently used as 1D time series.

The function score_cv, if I'm not mistaken, is a code to make a cross validation split in a set of multiple trajectories. If you only have a single trajectory, it will not behave as expected. Cross-validation can be done in several ways and is also possible for single trajectories.

andresilvapimentel commented 3 years ago

Here are the data print(type(torsions_data)) print(len(torsions_data)) print(torsions_data) <class 'numpy.ndarray'> 501 [[ 0.35456756 -0.9350304 -0.9936108 ... -0.574567 -0.8755304 0.48316303] [ 0.5644656 -0.8254566 -0.15634602 ... -0.7281929 -0.74392456 0.6682636 ] [ 0.7822153 -0.62300825 -0.05416011 ... -0.9307273 -0.8302969 0.55732137] ... [-0.0182284 -0.9998338 -0.92581666 ... -0.9603366 -0.6912165 0.7226477 ] [-0.06916294 -0.9976054 -0.99807996 ... -0.99246496 -0.845842 0.5334335 ] [ 0.09104805 -0.9958465 -0.99964976 ... -0.9669378 -0.9741069 0.22608787]]

print(type(positions_data)) print(len(positions_data)) print(positions_data) <class 'numpy.ndarray'> 501 [[5.032 5.6740003 2.4680002 ... 5.4080005 4.2640004 4.8320003] [4.9820004 5.5360003 2.3100002 ... 5.51 4.2200003 4.86 ] [5.322 5.6520004 2.41 ... 5.6340003 4.124 4.9620004] ... [9.51 1.95 5.458 ... 6.596 3.6940002 5.51 ] [9.412001 1.8780001 6.044 ... 6.7860003 3.6640003 5.55 ] [9.616 1.8340001 6.5160003 ... 6.8960004 3.3400002 5.6580005]]

print(type(distances_data)) print(len(distances_data)) print(distances_data) <class 'numpy.ndarray'> 501 [[0.37615922 0.49618506 0.55720013 ... 0.38283676 0.4672515 0.35082757] [0.32674178 0.44612575 0.5509339 ... 0.38586003 0.46542895 0.35145992] [0.31689087 0.44170573 0.54378307 ... 0.38488457 0.4627092 0.34586102] ... [0.362717 0.4819956 0.5591063 ... 0.3957121 0.47624367 0.3355831 ] [0.36416477 0.49104774 0.5538848 ... 0.3760216 0.43495744 0.31988096] [0.37694556 0.50163716 0.5549954 ... 0.37494037 0.45380637 0.33198816]]

andresilvapimentel commented 3 years ago

I will try with two trajectories and let you know what happen...

andresilvapimentel commented 3 years ago

I duplicated the only trajectory I have and upload both trajectories in the notebook. This was the way I came up to cheat the score_cv function. And, it worked it out. So, it must have two trajectories, at least.

Now, I got another error message in the TICA analysis. So, I am going to close this issue and open another one. Thanks!!!

markovmodel / PyEMMA

Dimension in the feature selection #1496