Closed hsidky closed 6 years ago
Hi @hsidky, thanks for raising this issue!
In the score_cv function, the feature trajectories are subsampled randomly without replacement using np.random.choice. This produces a non-sequential (shuffled) and unevenly-spaced trajectory.
Considering frame-wise subsampling, you are correct. Here, however, we are subsampling entire (independent) trajectories (as you have mentioned as an alternative for k-fold cross validation).
Also, as currently implemented, score_cv does not perform "cross validation" proper (though not explicitly stated in the notebook it's implied by the function name), which splits the data into k folds, each one being used once as validation and the remaining times as part of the training set.
Yes, we do not use a "proper" k-fold cross validation. Instead, we randomly split the 25 independent trajectories into training and test sets and repeat this process.
If I am mistaken and it is possible to subsample in the manner currently implemented in the notebook, I would definitely appreciate an explanation.
I agree, this approach needs a more detailed explanation.
Thanks again for pointing this out and your offer to contribute 👍
Hi,
The cross validation performed in the
00-pentapeptide-showcase.ipynb
notebook under feature selection appears to be incorrect. In thescore_cv
function, the feature trajectories are subsampled randomly without replacement usingnp.random.choice
. This produces a non-sequential (shuffled) and unevenly-spaced trajectory. How can a lag-time be specified for VAMP scoring when the data are shuffled?Also, as currently implemented,
score_cv
does not perform "cross validation" proper (though not explicitly stated in the notebook it's implied by the function name), which splits the data into k folds, each one being used once as validation and the remaining times as part of the training set. In this sense there is no need to define number of foldsk
and validation fraction, since ifk = 5
for example, the validation fraction is by definition0.2
. It would also be performed without shuffling for the reasons stated above.Alternatively, if there are some
n*k
independent trajectories, then each fold can consist of a subset of the trajectories.If I am mistaken and it is possible to subsample in the manner currently implemented in the notebook, I would definitely appreciate an explanation. If not, then I would be happy to make the correction and submit a pull request, or leave it up to you.
Thanks for the excellent work and contribution!
Hythem