markovmodel / pyemma_tutorials

How to analyze molecular dynamics data with PyEMMA
Creative Commons Attribution 4.0 International
71 stars 34 forks source link

Validation scoring using shuffled subsamples #134

Closed hsidky closed 6 years ago

hsidky commented 6 years ago

Hi,

The cross validation performed in the 00-pentapeptide-showcase.ipynb notebook under feature selection appears to be incorrect. In the score_cv function, the feature trajectories are subsampled randomly without replacement using np.random.choice. This produces a non-sequential (shuffled) and unevenly-spaced trajectory. How can a lag-time be specified for VAMP scoring when the data are shuffled?

Also, as currently implemented, score_cv does not perform "cross validation" proper (though not explicitly stated in the notebook it's implied by the function name), which splits the data into k folds, each one being used once as validation and the remaining times as part of the training set. In this sense there is no need to define number of folds k and validation fraction, since if k = 5 for example, the validation fraction is by definition 0.2. It would also be performed without shuffling for the reasons stated above.

Alternatively, if there are some n*k independent trajectories, then each fold can consist of a subset of the trajectories.

If I am mistaken and it is possible to subsample in the manner currently implemented in the notebook, I would definitely appreciate an explanation. If not, then I would be happy to make the correction and submit a pull request, or leave it up to you.

Thanks for the excellent work and contribution!

Hythem

cwehmeyer commented 6 years ago

Hi @hsidky, thanks for raising this issue!

In the score_cv function, the feature trajectories are subsampled randomly without replacement using np.random.choice. This produces a non-sequential (shuffled) and unevenly-spaced trajectory.

Considering frame-wise subsampling, you are correct. Here, however, we are subsampling entire (independent) trajectories (as you have mentioned as an alternative for k-fold cross validation).

Also, as currently implemented, score_cv does not perform "cross validation" proper (though not explicitly stated in the notebook it's implied by the function name), which splits the data into k folds, each one being used once as validation and the remaining times as part of the training set.

Yes, we do not use a "proper" k-fold cross validation. Instead, we randomly split the 25 independent trajectories into training and test sets and repeat this process.

If I am mistaken and it is possible to subsample in the manner currently implemented in the notebook, I would definitely appreciate an explanation.

I agree, this approach needs a more detailed explanation.

Thanks again for pointing this out and your offer to contribute 👍