franknoe commented 9 years ago

Having thought about the time-lagged data problem, I suggest the following change with respect to your latest TICA fix:

X runs always through all data, even if no time-lagged data (Y) is available due to limited trajectory length
Y has at most the same length of X but could be shorter. It can be None if no time-lagged data is available anymore in the present trajectory
Transformers that use both X and Y do need to check if Y is None and need to take the size of Y into account. This is very simple - for example for TICA:

if Y is not None: n = Y.shape[0] Ctau = np.dot(X[:n].T, Y)
The respective N values (for normalizing the mean for the time-lagged covariance matrix) also have to be counted by the specific transformer.

This convention forces the specific transformer to be explicit. When omitting the None check or not taking the explicit length of Y into account this will lead to an Exception, which is better than computing the wrong result. In contrast I find the 'trick' to use part of Y to compute the mean if one is at the end of a trajectory too convoluted - other people will have a hard time understanding this code.

franknoe commented 9 years ago

Before making code changes here, let's think about the mathematics a bit. Perhaps it is correct to compute mean and instantaneous correlation matrix on t-tau data points only, in order to have a consistent normalization. In any case we are not computing the real mean (which would requiring reweighting out-of equilibrium data), so perhaps all is needed for the empirical mean is that it is consistently computed with the time-lagged quantities, such that all normalizations are OK and we can expect to get eigenvalues below 1.

gph82 commented 9 years ago

Hi, regardless of how one implements it (with the copying data from Y or the other way), I do not have a clear opinion about this, except for the following: I believe one should use the estimations (means and covars) of the dataset that will ultimately be transformed (=all data). This way, one arrives at TICs that will be actually meanfree and var=1 (to the degree of the precison etc)

Still, let's think about it,

(nice WE everybody!)

franknoe commented 9 years ago

This is not at all clear to me. If you want to solve a generalized EV-Problem with covariance matrices C(0) and C(tau) obtained from empirical data estimates - what is the correct way of estimating the mean such that these matrices will be normalized correctly (e.g. such that we can always expect eigenvalues <= 1)?

It appears reasonable that you should use all data for the time-instantaneous correlation matrix. But perhaps for the time-lagged covariance matrix the answer is that the mean should only be computed on the T-tau frames.

Am 21/03/15 um 14:51 schrieb gph82:

Hi, regardless of how one implements it (with the copying data from Y or the other way), I do not have a clear opinion about this, except for the following: I believe one should use the estimations (means and covars) of the dataset that will ultimately be transformed (=all data). This way, one arrives at TICs that will be actually meanfree and var=1 (to the degree of the precison etc)

Still, let's think about it,

(nice WE everybody!)

— Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/138#issuecomment-84341740.

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

fabian-paul commented 9 years ago

One requirement that we could impose is that Cii(0)>=Cii(tau). I don't know yet how this relates to the eigenvalues of TICA but I suspect that this could imply that the eigenvalues are <=1. Can you show this? For infinite data it follows from the rearrangement inequality that Cii(0)>=Cii(tau). This would also be true, if we normalize Cii(0) and Cii(tau) with the same number. This is not how the TICA code does it at the moment. Therefore I guess that eigenvalues <=1 is not guaranteed by our implementation (for little data).

fabian-paul commented 9 years ago

Of course there is room for discussion whether we need to require that the eigenvalues <=1.

marscher commented 9 years ago

The data issue has been solved by #140 by Fabian, Guille and me. Do you want to continue the discussion about the math here or open a new issue?

franknoe commented 9 years ago

I'll open a new issue

franknoe commented 9 years ago

This discussion is continued here:

markovmodel / PyEMMA

simplify param_add_data #138

Mail: Arnimallee 6, 14195 Berlin, Germany

143