markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
311 stars 119 forks source link

Dimension reduction #1469

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hello,

I am trying to build an MSM for a 17-residue RNA system, using eRMSD and G-vectors as my features. Upon loading features, I get 1157 dimensions and when I transform my data using TICA, I get 116 dimensions. I have considered moving on with only the first two ICs, but I am worried because they do not account for a lot of the cumulative variance.

Screen Shot 2020-08-14 at 3 41 06 PM

Is it practical to move on with all the tica dimensions I got? If no, what can I do to further reduce dimensionality?

Thank you,

Tia

thempel commented 4 years ago

The point about reducing dimensionality in the context of MSM estimation is that you want to map down to a dimensionality that you can discretize without too much discretization errors. The higher the dimensionality, the more difficult to discretize. It depends a lot on the dataset what number of dimensions is feasible. It is certainly > 2, but I'd also stay below, say, 100. You can have a look at the marginal distributions using pyemma.plots.plot_feature_histograms(), often higher TICA dimensions become a bit noisy. But the question on how many dimensions to keep is not at all trivial, you need to make sure you are not discarding important / interesting processes which requires some sort of understanding what the TICs mean in terms of your data. Maybe our tutorials are also interesting for you (in notebook 2 we explain dimension reduction).

ghost commented 4 years ago

thank you