msmbuilder / msmbuilder-legacy

Legacy release of MSMBuilder
http://msmbuilder.org
GNU General Public License v2.0
25 stars 28 forks source link

Question: Criteria for selecting an 'appropriate' number of microstates #363

Open s-gordon opened 10 years ago

s-gordon commented 10 years ago

Hi all.

My understanding is that when making a fine-grained MSM, the intention is make as many states as possible to gain a sufficiently detailed kinetic landscape, yet at the same time avoid making TOO MANY states given that this can create poorly connected states that fudge the statistics.

With this in mind, I'm aiming to decompose a large number of conformations from simulation into microstates by applying a certain cluster radius cutoff (e.g. a few Angstroms). If I apply a cutoff below a certain range (e.g. ≤ 2Å) the 'quality' of the plot of lag time against relaxation timescales deteriorates considerably (i.e. loss of hierarchy, structure, etc). Above this cutoff, the hierarchy of timescales remains fairly consistent, although the number of microstates obviously tapers off quite rapidly.

My question is: is what I have described above an appropriate criterion for choosing the largest number of microstates that I can safely partition my model into?

rmcgibbo commented 10 years ago

My understanding is that when making a fine-grained MSM, the intention is make as many states as possible to gain a sufficiently detailed kinetic landscape, yet at the same time avoid making TOO MANY states given that this can create poorly connected states that fudge the statistics.

Yes. This is basically correct. There's a bias-variance tradeoff at play.

With this in mind, I'm aiming to decompose a large number of conformations from simulation into microstates by applying a certain cluster radius cutoff (e.g. a few Angstroms). If I apply a cutoff below a certain range (e.g. ≤ 2Å) the 'quality' of the plot of lag time against relaxation timescales deteriorates considerably (i.e. loss of hierarchy, structure, etc). Above this cutoff, the hierarchy of timescales remains fairly consistent, although the number of microstates obviously tapers off quite rapidly.

My interpretation of this is basically: "Use as many states as possible without the timescales going super-wacko (which is interpreted as a signal of high statistical uncertainty in the estimators)". I think it's a pretty reasonable criterion. Unfortunately, the discrete-state MSM formalism doesn't admit much analytic theory on the bias/variance tradeoff, or any algorithmic approaches to balance the two sources of error that is well supported from a theoretical perspective.