markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
311 stars 119 forks source link

TICA inverse_transform #653

Closed stefdoerr closed 8 years ago

stefdoerr commented 8 years ago

Say I want for some reason to calculate the reconstruction error of TICA. The TICA class does not have a inverse_transform method implemented. Looking at _transform_array

def _transform_array(self, X):
    X_meanfree = X - self.mu
    Y = np.dot(X_meanfree, self._eigenvectors[:, 0:self.dimension()])
    if self._kinetic_map:  # scale by eigenvalues
        Y *= self._eigenvalues[0:self.dimension()]
    return Y

this would be the correct inverse transform, right?

tica = TICA(20, 10)
datatransf = tica.fit_transform(data) # forward transform
if tica._kinetic_map: # scale back
    datatransf /= tica._eigenvalues[0:tica.dimension()] 
reconst = np.dot(datatransf, np.transpose(tica._eigenvectors[:, 0:tica.dimension()])) # Transpose eigenvector matrix
reconst += tica.mu # add mean back
marscher commented 8 years ago

Looks correct. You could simply use numpy.testing.assert_almost_equal on the result to check if you obtain the original data.

stefdoerr commented 8 years ago

Ah yes, if I use all the dimensions. Valid point ;) Thanks.

stefdoerr commented 8 years ago

Hm, I am getting weird behaviour from the TICA class. I ask for all dimensions TICA(20, 595) and indeed if I do tica.dimension() I get 595. However if I do tica.transform(data) it returns 528 dimensional data. What is the reason for this?

gph82 commented 8 years ago

Hi Stefan, I think I have an idea, can you please post the original TICA-call?

stefdoerr commented 8 years ago
tica = TICA(20, 595)
proj1 = tica.fit_transform(datalist)
proj2 = tica.transform(otherdataarray)
franknoe commented 8 years ago

That's normal and correct. It is necessary to reduce the rank of the matrices to the numerical rank in order to solve the problem. You generally can not expect that the output dimension equals the input dimension (unless your problem is low-dimensional to start with).

Am 16/12/15 um 16:07 schrieb Stefan:

Hm, I am getting weird behaviour from the TICA class. I ask for all dimensions |TICA(20, 595)| and indeed if I do |tica.dimension()| I get 595. However if I do |tica.transform(data)| it returns 528 dimensional data. What is the reason for this?

— Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/653#issuecomment-165135512.


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 commented 8 years ago

So if I am not mistaken, it has to do with the epsilon argument, that is, to ignore TICA-vectors that have eigenvalues smaller than that epsilon.

We seem to have intended to document this at some moment: the docs for transform say "see epsilon", but then nothing is mentioned. I'm going to post the code where epsilon happens, perhaps we should improve the docs

stefdoerr commented 8 years ago

Example:

In [7]: tica = TICA(20, 595)
In [8]: tica.dimension()
Out[8]: 595

In [9]: np.shape(data.dat.tolist())
Out[9]: (2130, 500, 595)

In [10]: dataticalist = tica.fit_transform(data.dat.tolist())
calculate covariances: 100% (2130/2130) [#############################################################################################################################] eta 00:00 |
In [11]: np.shape(dataticalist)
Out[11]: (2130, 500, 528)
gph82 commented 8 years ago

wow! 3 guys posting simultaneously

gph82 commented 8 years ago

From the generalized eigenvalue problem-solver https://github.com/markovmodel/PyEMMA/blob/21bbe600d044a2e0e59b280dba6e2510ba2a5158/pyemma/util/linalg.py#L117

    epsilon : float
        eigenvalue norm cutoff. Eigenvalues of C0 with norms <= epsilon will be
        cut off. The remaining number of Eigenvalues define the size of
        the output.
stefdoerr commented 8 years ago

Okay, then maybe tica.dimension() should return the actual number of dimensions it will project on. Maybe even add some logging.info if you want saying that the number of dimensions were reduced to X.

franknoe commented 8 years ago

Am 16/12/15 um 16:34 schrieb Stefan:

Okay, then maybe |tica.dimension()| should return the actual number of dimensions it will project on.

Yes! If that is not the case, it is a bug.

Maybe even add some |logging.info| if you want saying that the number of dimensions will be reduced.

— Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/653#issuecomment-165145489.


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

gph82 commented 8 years ago

And perhaps add the role of epsilon to the docs . @stefdoerr , just out curiosity, can you call transform with an epsilon close to zero, say 1e-17?

stefdoerr commented 8 years ago

@franknoe It's a bug then:

In [15]: tica = TICA(20, 595)

In [16]: tica.dimension()
Out[16]: 595

In [17]: dataticalist = tica.fit_transform(data.dat.tolist())
calculate covariances: 100% (2130/2130) [#############################################################################################################################] eta 00:01 |
In [18]: tica.dimension()
Out[18]: 595

In [19]: np.shape(dataticalist)
Out[19]: (2130, 500, 528)

@gph82 doesn't seem to help

In [20]: tica = TICA(20, 595, epsilon=1E-17)

In [21]: tica.dimension()
Out[21]: 595

In [22]: dataticalist = tica.fit_transform(data.dat.tolist())
calculate covariances: 100% (2130/2130) [#############################################################################################################################] eta 00:01 |
In [23]: tica.dimension()
Out[23]: 595

In [24]: np.shape(dataticalist)
Out[24]: (2130, 500, 528)
gph82 commented 8 years ago

@stefdoerr , I mean when calling transform...

stefdoerr commented 8 years ago
In [25]: dataticalist = tica.fit_transform(data.dat.tolist(), epsilon=1E-17)
calculate covariances: 100% (2130/2130) [#############################################################################################################################] eta 00:01 -
In [26]: np.shape(dataticalist)
Out[26]: (2130, 500, 528)

Why would that make a difference if I pass it in the constructor or transform.

gph82 commented 8 years ago

I thought it would...sorry, I got that wrong...we'll look into this, thanks!

franknoe commented 8 years ago

You can also call it with 0, but that won't retain all dimensions either, because numerically the eigenvalues of an SPD matrix can be negative, and the largest norm of a negative eigenvalue will determine the minimum epsilon being used.

Am 16/12/15 um 16:38 schrieb Guillermo Pérez-Hernández:

And perhaps add the role of epsilon to the docs . @stefdoerr https://github.com/stefdoerr , just out curiosity, can you call transform with an epsilon close to zero, say 1e-17?

— Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/653#issuecomment-165147278.


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

stefdoerr commented 8 years ago

I called it with 0 but I got an error for NaN/Infs

franknoe commented 8 years ago

@fabian-paul should have a look at it and the output dimension. I'm under water til friday.

Am 16/12/15 um 16:56 schrieb Stefan:

I called it with 0 but I got an error for NaN/Infs

— Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/653#issuecomment-165153731.


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

fabian-paul commented 8 years ago

Ok, let's see what I can do. First of all tica.dimension() must be fixed. When the correlation matrix doesn't have full rank, there is no inverse. But you can compute the pseudo-inverse. @stefdoerr Would this help for computing the reconstruction error? Ah, and the inverse of the eigenvector matrix doesn't equal its transpose. TICA is not an orthogonal coordinate transform.

fabian-paul commented 8 years ago

The wrong dimension is coming from this part of the code https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/transform/tica.py#L168

d = None
if self._dim != -1:  # fixed parametrization
    d = self._dim
elif self._parametrized:  # parametrization finished. Dimension is known
    dim = len(self._eigenvalues)
    if self._var_cutoff < 1.0:  # if subspace_variance, reduce the output dimension if needed
        dim = min(dim, np.searchsorted(self._cumvar, self._var_cutoff)+1)
    d = dim
elif self._var_cutoff == 1.0:  # We only know that all dimensions are wanted, so return input dim
    d = self.data_producer.dimension()
else:  # We know nothing. Give up
    raise RuntimeError('Requested dimension, but the dimension depends on the cumulative variance and the '
                               'transformer has not yet been parametrized. Call parametrize() before.')

Does anybody know why we have the first branch in that if clause at all? What is the point of having a fixed parametrization when it can contradict the data? Why is this needed?

I think that we should at least change the first line to

if self._dim != -1 and not self._parametrized: ...
marscher commented 8 years ago

You're right. It should be checked if the parametrization has been done prior blindly returning the fixed dimension.

stefdoerr commented 8 years ago

@fabian-paul Yes thanks! Didn't know the TICA transform was not orthogonal. The pseudo-inverse worked great. Was wondering why the reconstruction costs were so huge before, hehe.