Adaptive Sampling and Structure Extraction from TICA or PCA Plot

markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂

http://pyemma.org

GNU Lesser General Public License v3.0

307 stars 118 forks source link

Adaptive Sampling and Structure Extraction from TICA or PCA Plot #1533

Closed hl2500 closed 2 years ago

hl2500 commented 2 years ago

Hello,

Can I ask if there are any tutorials for adaptive sampling with pyemma? I was building a model but the fraction of states used is only 0.22. Is this because of the poor sampling and some of the states are disconnected? How could I know which trajectory frame I need to extract and start new simulations?

Also, can I ask how to extract the most probable structure from different states on TICA or PCA plots (e.g. IC1 X IC2), without building MSM?

Thank you!

clonker commented 2 years ago

Hi,

as far as I know there are no notebooks specifically dealing with adaptive sampling. Generally if the fraction of states is low that means that regions are disconnected and your sampling isn't good enough. You could define a reaction coordinate that steers the adaptive sampling process. Based on that you can then also pick frames to start new simulations.

Regarding the projections: I do not recommend using PCA for this, concerning TICA you can have a look at the free energy surface, this should give you some clues on the states the system likes to be in (in projected space). It is well possible that two components are not enough to adequately describe the energy wells and their proximity to another though.

Best, Moritz

thempel commented 2 years ago

For extracting the most probable structure (or, more precisely, the one that you've observed most frequently), you can conduct a histogram analysis in your transformed space. In a 2D space that could be done with numpy, but in general you can conduct a clustering with e.g. k-means and count the number of occurrences of each states with e.g. np.bincount(np.concatenate(cluster.dtrajs)). From these histogram counts, you can select the state with the highest number of counts and draw frames from your simulation that were clustered in this state (e.g. with cluster.sample_indexes_by_cluster). You can use pyemma.coordinates.save_traj to write out frames.

Be aware that the states that you get here are purely based on your observation data and may be biased by limited sampling.

hl2500 commented 2 years ago

Thank you for the suggestions!