Closed franknoe closed 9 years ago
Before making code changes here, let's think about the mathematics a bit. Perhaps it is correct to compute mean and instantaneous correlation matrix on t-tau data points only, in order to have a consistent normalization. In any case we are not computing the real mean (which would requiring reweighting out-of equilibrium data), so perhaps all is needed for the empirical mean is that it is consistently computed with the time-lagged quantities, such that all normalizations are OK and we can expect to get eigenvalues below 1.
Hi, regardless of how one implements it (with the copying data from Y or the other way), I do not have a clear opinion about this, except for the following: I believe one should use the estimations (means and covars) of the dataset that will ultimately be transformed (=all data). This way, one arrives at TICs that will be actually meanfree and var=1 (to the degree of the precison etc)
Still, let's think about it,
(nice WE everybody!)
This is not at all clear to me. If you want to solve a generalized EV-Problem with covariance matrices C(0) and C(tau) obtained from empirical data estimates - what is the correct way of estimating the mean such that these matrices will be normalized correctly (e.g. such that we can always expect eigenvalues <= 1)?
It appears reasonable that you should use all data for the time-instantaneous correlation matrix. But perhaps for the time-lagged covariance matrix the answer is that the mean should only be computed on the T-tau frames.
Am 21/03/15 um 14:51 schrieb gph82:
Hi, regardless of how one implements it (with the copying data from Y or the other way), I do not have a clear opinion about this, except for the following: I believe one should use the estimations (means and covars) of the dataset that will ultimately be transformed (=all data). This way, one arrives at TICs that will be actually meanfree and var=1 (to the degree of the precison etc)
Still, let's think about it,
(nice WE everybody!)
— Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/138#issuecomment-84341740.
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
One requirement that we could impose is that Cii(0)>=Cii(tau). I don't know yet how this relates to the eigenvalues of TICA but I suspect that this could imply that the eigenvalues are <=1. Can you show this? For infinite data it follows from the rearrangement inequality that Cii(0)>=Cii(tau). This would also be true, if we normalize Cii(0) and Cii(tau) with the same number. This is not how the TICA code does it at the moment. Therefore I guess that eigenvalues <=1 is not guaranteed by our implementation (for little data).
Of course there is room for discussion whether we need to require that the eigenvalues <=1.
The data issue has been solved by #140 by Fabian, Guille and me. Do you want to continue the discussion about the math here or open a new issue?
I'll open a new issue
This discussion is continued here:
Having thought about the time-lagged data problem, I suggest the following change with respect to your latest TICA fix:
Transformers that use both X and Y do need to check if Y is None and need to take the size of Y into account. This is very simple - for example for TICA:
if Y is not None: n = Y.shape[0] Ctau = np.dot(X[:n].T, Y)
This convention forces the specific transformer to be explicit. When omitting the None check or not taking the explicit length of Y into account this will lead to an Exception, which is better than computing the wrong result. In contrast I find the 'trick' to use part of Y to compute the mean if one is at the end of a trajectory too convoluted - other people will have a hard time understanding this code.