jeiros / Jupyter_notebooks

Collection of notebooks for data analysis/visualization
MIT License
0 stars 0 forks source link

Errors: gmrq-model-selection.ipynb #1

Open AlirezaTafazzol opened 6 years ago

AlirezaTafazzol commented 6 years ago

Hi Juan, I was following your code for GMRQ scoring model selection for Markov State Models (gmrq-model-selection.ipynb). I have the same problem as yours. I guess it's because my system is large and I do not have enough data to do this scoring method. Did you finally figure it out how to solve this issue? Or did you move to something else? Because I found in your 2016 paper in the "Physical Chemistry Chemila Physics" journal that you used DBSCAN algorithm instead.

I would be grateful if you can give me some advice regarding the code written in this notebook as if it works or what are the problems with that. And what do you suggest to do instead for selecting parameters for MSMs?

Thanks, Ali

jeiros commented 6 years ago

Hi Ali,

Thanks for leaving a comment!

The notebook you mention is quite old, and I don't really use that approach now. If you want to do hyperparameter optimization for your MSM analysis pipeline, I recommend using Osprey. It's a package that automates the process and merges really nicely with msmbuilder, mdtraj and scikit learn. I've found it really easy to use. But the problem you mention is pervasive, having sufficient sampling for large systems is really complicated. So when doing cross validation you actually need to have A LOT of data since it's doing splitting of it. Here is a discussion of this issue that might be of interest.

What I've started to do is build MSM on small regions of my system that might have mechanistic interest and which are easier to build converged MSMs on. So for example, my protein is quite big (419 residues) and when I solvate it my system has about 300k atoms. It takes me a long time to generate a decent data set (so far I've been running simulations for about 2 years and have gathered around 200 us). Hence I cannot really build converged MSMs of the complete protein (based on dihedrals or alpha carbons interdistances), but I can however build MSMs on certain regions of interest. Basically, you can still build crude MSMs without worrying to much about fine tuning every parameter in your analysis pipeline. Also have a read at this note which explains why you cannot optimize the MSM lag time (or the tICA lag time for example if you use that as the approximation to the transfer operator).

The PCCP paper you mention is some old work that I did at the beginning of my PhD, and at that time I had not really discovered MSMs yet. So the clustering that you see was done with cpptraj, one of the programs in the AmberTools suite. In practice that paper shows that you can still do clustering and some basic analysis on your protein and still gain insight from it, it's just not enough to build an MSMs that is able to give you more interesting metrics like the timescales related to your conformational changes, the mean first passage time between a couple of macrostates, etc.

Hope that helps, if you have any other questions let me know.