markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
311 stars 119 forks source link

"k" and "stride" in pyemma.coordinates.cluster_kmeans #1404

Closed lunnali closed 5 years ago

lunnali commented 5 years ago

Dear PyEMMA Users

I am new to both MSM and PyEMMA and I am currently doing some parameter testing to get myself familiar.

When I try to cluster the same set of data (after dimension reduction) via pyemma.coordinates.cluster_kmeans(), I find that "stride" has to be smaller for larger number of clusters k and vice versa. Can I understand it as such: using data at larger stride intervals for discretization can only support a smaller number of clusters; the largest number of clusters is limited by using stride = 1 because it means that all the available data are used.

Is there a criterion for choosing the best "stride/k" combination? E.g. using VAMP2-score via pyemma.msm.estimate_markov_model() and score_cv(), to check if VAMP2-score can be saturated for increasing number of clusters?

However, after I play some numbers, I realize that as long as I have chosen a (apparently) reasonable msm lag time (from its testing via pyemma.msm.its), changing "stride/k" does not affect value of saturated VAMP2-score too much?

Thank you so much for patience!

Best Wishes

thempel commented 5 years ago

I don't think that the stride should be optimized as a model hyperparameter such as the number of cluster centers, k. Of course there is this effect that if you remove too much of your data by striding it too much, i.e. if you have less frames than cluster centers you want to assign, the clustering will fail. Generally, you don't want your results to depend on the stride. If your results for the ML MSM significantly change in comparison to stride=1, you probably should not trust them.

About your last point: Difficult to tell without looking at your data. The simplest solution would be that you have already lost all your processes (possibly because of a very large MSM lag time) such that changing the stride / number of cluster centers cannot worsen the results. But I guess there are other, data specific solutions to this question.