markovmodel / pyemma_tutorials

How to analyze molecular dynamics data with PyEMMA
Creative Commons Attribution 4.0 International
71 stars 34 forks source link

How to treat large/complex systems #138

Closed cwehmeyer closed 5 years ago

cwehmeyer commented 5 years ago

We need to address how dealing with large/complex systems differs from the tutorial cases, e.g., using source instead of load, convergence issues, etc.

This should go into the manuscript, but I there are also notebooks where such explanations might be a good fit.

thempel commented 5 years ago

In the tutorial, we already have mentioned using source() instead of load(). So I suggest the following approach: a) We add some citations about complex systems to the manuscript and mention that there are differences. b) We add an example that explains what happens if we operate at the edge of poor sampling, i.e. partially converged ITS and CK-test that breaks down after a certain number of lags. I'm already trying to compile an example to for #140. That could go into NB08.

thempel commented 5 years ago

c) add a paragraph on the importance of dimension reduction before clustering and implications of density distributions for k-means in NB02 d) discuss ITS convergence in more detail in NB03 / di-ala section

cwehmeyer commented 5 years ago

That sounds very reasonable!

thempel commented 5 years ago

One point that I mentioned in the notebooks as well as in the manuscript concerning large systems it that clustering becomes difficult in high dimensional spaces. My question is if we need a citation for this or if we can just claim this as a part of our daily experience.

If I understand this correctly, a paper that seems to fit this purpose would be Aggarwal et al, 2001, "On the Surprising Behavior of Distance Metrics in High Dimensional Space".

@cwehmeyer @brookehus

brookehus commented 5 years ago

I don't know if we need a citation - my comments were more to the point that the mentions of it becoming difficult at high dimensions don't make it clear why that's the case. i.e. is it because it's computationally difficult or because it's computationally fine but we are less confident in the model.

In one of my papers I showed that models consistently achieve better VAC/GMRQ scores for lower dimensional spaces; see here, sec. V C