msmbuilder / msmbuilder-legacy

Legacy release of MSMBuilder
http://msmbuilder.org
GNU General Public License v2.0
25 stars 28 forks source link

Clustering with tICA #428

Closed asgharrazavi closed 10 years ago

asgharrazavi commented 10 years ago

Hi,

I am from Vince Voelz lab and trying to use tICA clustering implemented in the new version of msmbuilder. I noticed two issues:

  1. The time lagged correlation and covariance matrices don't converge when solving the eigenvalue problem. I receive the following error: """ numpy.linalg.linalg.LinAlgError: generalized eig algorithm did not converge (info=1646) """
  2. tICA object created using the old version apparently is not compatible with the new one. I am receiving this error: """ KeyError: "metric_string not in ['cov_mat', 'timelag_corr_mat', 'vals', 'vecs']" """

We greatly appreciate any help to overcome these issues.

Regards, Asghar

schwancr commented 10 years ago

Hi Ashgar,

(1) For this problem, it's likely due to the size of the covariance matrix and the amount of data you have. You can try passing the pca_cutoff to the tICA object's solve method, which will throw out degrees of freedom that are undersampled and have zero variance in the observed data. These coordinates will screw up the eigenvalue calculations.

(2) This is a bit more convenient now, because it will pickle the metric you use to prepare trajectories. But there's not a script that will update if for you. You can however do this by doing the following:

import pickle
from mdtraj import io

# construct whatever metric you used to solve the tICA problem originally
metric = ...

metric_string = pickle.dumps(metric)
io.saveh("tica_filename.h5", metric_string=np.array([metric_string]))

Then you should be able to load the file into the new version of the code.

asgharrazavi commented 10 years ago

putting pac_cutoff=0.5 works without error, thanks a lot.

schwancr commented 10 years ago

You want to play around with this parameter (and probably use a smaller number, like 1E-8) to see what it looks like.

What's happening is by using this cutoff, you are actually doing tICA in PCA space, by first projecting onto all PC's that have variance greater than the cutoff (0.5 in your case). Then you do tICA in that space. This makes the eigenvalue step more robust, but depending on your units, 0.5 could actually be a very large variance.

The thing to look at is the eigenvalues of the covariance matrix. The smallest of these will be clustered around zero (maybe some are slightly negative due to numerical issues). Try to pick a cutoff that only removes the very smallest ones (~ 1%) if possible.

schwancr commented 10 years ago

By the way, this PCA/tICA thing is currently unpublished (though it's not too complicated), but the paper is currently in preparation, so I can send you a citation when we have it.

asgharrazavi commented 10 years ago

That would be great. looking forward to it.