Closed andresilvapimentel closed 3 years ago
If your data is only one-dimensional then you can't really project it into 10 dimensions, this is basically what the error message says. That doesnt have to do with the number of trajectories but rather with the molecule you are working with. If it only has one torsion angle, then that is what you get by using only torsions in the featurizer.
This is a large protein that I generate the xtc file with only the backbone chain. I removed water, cations, and the rest of the protein leaving only the backbone. Is it ok to do that? After running with dimension equal 1, it worked it out and I found that the VAMP2 score of the backbone atoms distances is larger than 1.5 for lag time 0.5 ns and the other features returned VAMP2 score equal to 1, which corresponds to the invariant measure or equilibrium. However, the VAMP-2 score versus numer of dimentions plot has not a curve choosing only one dimension (dim = 1). This does not make sense, right? Something is wrong in the trajectory or topology, right?
I also tried the xtc file with the all atom configuration instead of trying the xtc file with only the backbone configuration. Doing this, pyemma only worked with dimension equal 1 (dim = 1). Do you know what is going on? If you want to check the xtc and pdb files, I can share with you in a google drive. Thank you for helping me.
If you have a large protein and you only get one single torsion angle, there could be something broken with the topology (i.e., the PDB file). Does it know the right atom types? How are you setting up the featurizer and what is the output of featurizer.describe()
?
I set up like this:
torsions_feat = pyemma.coordinates.featurizer('spa_no_ca.pdb') torsions_feat.add_backbone_torsions(cossin=True, periodic=False) torsions_data = pyemma.coordinates.load(files, features=torsions_feat) labels = ['backbone\ntorsions']
positions_feat = pyemma.coordinates.featurizer('spa_no_ca.pdb') positions_feat.add_selection(positions_feat.select_Backbone()) positions_data = pyemma.coordinates.load(files, features=positions_feat) labels += ['backbone atom\npositions']
distances_feat = pyemma.coordinates.featurizer('spa_no_ca.pdb') distances_feat.add_distances( distances_feat.pairs(distances_feat.select_Backbone(), excluded_neighbors=2), periodic=False) distances_data = pyemma.coordinates.load(files, features=distances_feat) labels += ['backbone atom\ndistances']
We tested pyemma using different pdb and xtc file, e.g., the lisozyme.pdb and lisozyme.xtc from the gromacs tutorial. We got the same error message about the dimension. And, it only worked it out using dim = 1. It seems we are doing something wrong generating the lisozyme.pdb and lisozyme.xtc files, I think. Could you explain me how to generate these files, please?
Well, you can upload your PDB file here and I will take a look. [Edit: Preferably the one that you want to use in the end.]
Here are two pdb files that I have been using in pyemma. spa_no_ca.pdb is an all atoms pdb file and spa_no_ca_back.pdb is a pdb file with only the backbone heavy atoms. Thank you for helping me.
Em sex., 26 de mar. de 2021 às 06:04, Tim Hempel @.***> escreveu:
Well, you can upload your PDB file here and I will take a look.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/1496#issuecomment-808054396, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3PBBMR47CIQCTLNX6QFQDTFREZNANCNFSM4ZZGD3KQ .
I also tried to use the lisozyme pdb file (1AKI pdb code) from the gromacs tutorial. You can download it by itself from the pdb database (https://www.rcsb.org/structure/1aki).
Email attachments do not get processed by GitHub, so I could only test the 1AKI pdf file downloaded from RCSB. That works, i.e., the featurizer tells me it's 512 dimensions with the code that you pasted above.
import pyemma
feat = pyemma.coordinates.featurizer('/tmp/1aki.pdb')
feat.add_backbone_torsions(cossin=True, periodic=False)
print(feat.dimension())
> 512
So... Do you have any idea what is going on about the issue on 1AKI dimensionality in my code? The spa_no_ca.pdb and spa_no_ca_back.pdb files have 1736 dimensions each.
The feature selection (below) was done with 1 dimension only because it does not work with more than 1 dimension:
def score_cv(data, dim, lag, number_of_splits=10, validation_fraction=0.5):
with pyemma.util.contexts.settings(show_progress_bars=False):
nval = int(len(data) * validation_fraction)
scores = np.zeros(number_of_splits)
for n in range(number_of_splits):
ival = np.random.choice(len(data), size=nval, replace=False)
vamp = pyemma.coordinates.vamp(
[d for i, d in enumerate(data) if i not in ival], lag=lag, dim=dim)
scores[n] = vamp.score([d for i, d in enumerate(data) if i in ival])
return scores
dim = 1
fig, axes = plt.subplots(1, 3, figsize=(12, 3), sharey=True)
for ax, lag in zip(axes.flat, [5, 10, 20]):
torsions_scores = score_cv(torsions_data, lag=lag, dim=dim)
scores = [torsions_scores.mean()]
errors = [torsions_scores.std()]
positions_scores = score_cv(positions_data, lag=lag, dim=dim)
scores += [positions_scores.mean()]
errors += [positions_scores.std()]
distances_scores = score_cv(distances_data, lag=lag, dim=dim)
scores += [distances_scores.mean()]
errors += [distances_scores.std()]
ax.bar(labels, scores, yerr=errors, color=['C0', 'C1', 'C2'])
ax.set_title(r'lag time $\tau$={:.1f}ns'.format(lag * 0.1))
if lag == 5:
# save for later
vamp_bars_plot = dict(
labels=labels, scores=scores, errors=errors, dim=dim, lag=lag)
axes[0].set_ylabel('VAMP2 score')
fig.tight_layout()
Please, let me know if there is anything wrong with this code.
Maybe data
is not a list of arrays but just a single array? Can you print type(data)
and len(data)
? I would guess that [d for i, d in enumerate(data) if i not in ival]
is the root cause of the 1 dimension problem as it's splitting your single trajectory into single frames which are subsequently used as 1D time series.
The function score_cv
, if I'm not mistaken, is a code to make a cross validation split in a set of multiple trajectories. If you only have a single trajectory, it will not behave as expected. Cross-validation can be done in several ways and is also possible for single trajectories.
Here are the data print(type(torsions_data)) print(len(torsions_data)) print(torsions_data) <class 'numpy.ndarray'> 501 [[ 0.35456756 -0.9350304 -0.9936108 ... -0.574567 -0.8755304 0.48316303] [ 0.5644656 -0.8254566 -0.15634602 ... -0.7281929 -0.74392456 0.6682636 ] [ 0.7822153 -0.62300825 -0.05416011 ... -0.9307273 -0.8302969 0.55732137] ... [-0.0182284 -0.9998338 -0.92581666 ... -0.9603366 -0.6912165 0.7226477 ] [-0.06916294 -0.9976054 -0.99807996 ... -0.99246496 -0.845842 0.5334335 ] [ 0.09104805 -0.9958465 -0.99964976 ... -0.9669378 -0.9741069 0.22608787]]
print(type(positions_data)) print(len(positions_data)) print(positions_data) <class 'numpy.ndarray'> 501 [[5.032 5.6740003 2.4680002 ... 5.4080005 4.2640004 4.8320003] [4.9820004 5.5360003 2.3100002 ... 5.51 4.2200003 4.86 ] [5.322 5.6520004 2.41 ... 5.6340003 4.124 4.9620004] ... [9.51 1.95 5.458 ... 6.596 3.6940002 5.51 ] [9.412001 1.8780001 6.044 ... 6.7860003 3.6640003 5.55 ] [9.616 1.8340001 6.5160003 ... 6.8960004 3.3400002 5.6580005]]
print(type(distances_data)) print(len(distances_data)) print(distances_data) <class 'numpy.ndarray'> 501 [[0.37615922 0.49618506 0.55720013 ... 0.38283676 0.4672515 0.35082757] [0.32674178 0.44612575 0.5509339 ... 0.38586003 0.46542895 0.35145992] [0.31689087 0.44170573 0.54378307 ... 0.38488457 0.4627092 0.34586102] ... [0.362717 0.4819956 0.5591063 ... 0.3957121 0.47624367 0.3355831 ] [0.36416477 0.49104774 0.5538848 ... 0.3760216 0.43495744 0.31988096] [0.37694556 0.50163716 0.5549954 ... 0.37494037 0.45380637 0.33198816]]
I will try with two trajectories and let you know what happen...
I duplicated the only trajectory I have and upload both trajectories in the notebook. This was the way I came up to cheat the score_cv function. And, it worked it out. So, it must have two trajectories, at least.
Now, I got another error message in the TICA analysis. So, I am going to close this issue and open another one. Thanks!!!
I loaded only one trajectory, but it was requested more output dimensions (10) (used the default value for dim = 10) than dimension of input data (1) in the feature selection. Do I need to load many trajectories? Pyemma randomly split the list of independent trajectories into a training and a validation set, compute the VAMP2 score, and repeat this process several times. However, what happens if I load only one trajectory? Is this the issue for the following error message?
RuntimeError Traceback (most recent call last)