msmbuilder / msmbuilder-legacy

Legacy release of MSMBuilder
http://msmbuilder.org
GNU General Public License v2.0
25 stars 28 forks source link

tICA #387

Closed rmcgibbo closed 8 years ago

rmcgibbo commented 10 years ago

(1) Should tICA.predict subtract out the mean before dotting the timeseries into the tICs? This is, for example, how PCA works in sklearn

    def transform(self, X):
        """Apply the dimensionality reduction on X.
        [...]
        """
        X = array2d(X)
        if self.mean_ is not None:
            X = X - self.mean_
        X_transformed = np.dot(X, self.components_.T)
        return X_transformed

(2) I think there's a bug for the PCA case on line 208. Here's the test case

import numpy as np
from msmbuilder.reduce import tICA
from sklearn.decomposition import PCA
X1 = np.random.randn(10, 3)
X2 = np.random.randn(10, 3)
X3 = np.random.randn(10, 3)

tica = tICA(lag=0)
tica.train(prep_trajectory=np.copy(X1))
tica.train(prep_trajectory=np.copy(X2))
print tica.project(prep_trajectory=np.copy(X3), which=[0])

print PCA(n_components=1).fit(np.vstack((X1, X2))).transform(np.copy(X3))
kyleabeauchamp commented 10 years ago

I suppose this could be beneficial. We main just need to carefully define what we mean by "project".

In some of my code, I was actually doing this, which would not be necessary if we had mean-subtracted projection.

x = tica.project(trj, which=arange(5))
x = sklearn.preprocessing.StandardScaler().fit_transform(x)
rmcgibbo commented 10 years ago

I accidently hit enter before I finished my comment. See above for point 2.

kyleabeauchamp commented 10 years ago

IMHO we shouldn't be writing our own PCA solver.

rmcgibbo commented 10 years ago

Agreed.

kyleabeauchamp commented 10 years ago

I suppose we should get Christian's buyin before we axe his PCA code, but my vote is to just force users to input a positive lagtime and get rid of all the extra branches.

rmcgibbo commented 10 years ago

Actually, my test case is subject to the difference in project. But the point is to say that on line 208, we're incrementing sum_t_dt by sum_t, when I'm pretty sure that's not the right behavior. They should be equal with lag==0

tica = tICA(lag=0)
for i in range(10):
    print 'iteration', i
    tica.train(prep_trajectory=X1)
    np.testing.assert_array_equal(tica.sum_t, tica.sum_t_dt)
$ python test.py
08:01:52 - no metric specified, you must pass prepared trajectories to the train and project methods
iteration 0
iteration 1
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    np.testing.assert_array_equal(tica.sum_t, tica.sum_t_dt)
  File "/Users/rmcgibbo/miniconda/lib/python2.7/site-packages/numpy/testing/utils.py", line 718, in assert_array_equal
    verbose=verbose, header='Arrays are not equal')
  File "/Users/rmcgibbo/miniconda/lib/python2.7/site-packages/numpy/testing/utils.py", line 644, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not equal

(mismatch 100.0%)
 x: array([  3.97145422, -11.23479935,   0.31692569])
 y: array([  5.95718132, -16.85219903,   0.47538853])
schwancr commented 10 years ago

I don't really care about the PCA code. I think I added it because the tICA code was lonely in the "reduce" folder.

As for mean subtracting, let's just do what sklearn does in PCA for consistency. Since we were only using tICA to calculate distances, it never mattered if we subtracted the mean.

On Sun, Mar 23, 2014 at 8:02 AM, Robert McGibbon notifications@github.comwrote:

Actually, my test case is subject to the difference in project. But the point is to say that on line 208, we're incrementing sum_t_dt by sum_t, when I'm pretty sure that's not the right behavior. They should be equal with lag=0

tica = tICA(lag=0) for i in range(10): print 'iteration', i tica.train(prep_trajectory=X1) np.testing.assert_array_equal(tica.sum_t, tica.sum_t_dt)

$ python test.py 08:01:52 - no metric specified, you must pass prepared trajectories to the train and project methods iteration 0 iteration 1 Traceback (most recent call last): File "test.py", line 13, in np.testing.assert_array_equal(tica.sum_t, tica.sum_t_dt) File "/Users/rmcgibbo/miniconda/lib/python2.7/site-packages/numpy/testing/utils.py", line 718, in assert_array_equal verbose=verbose, header='Arrays are not equal') File "/Users/rmcgibbo/miniconda/lib/python2.7/site-packages/numpy/testing/utils.py", line 644, in assert_array_compare raise AssertionError(msg) AssertionError: Arrays are not equal

(mismatch 100.0%) x: array([ 3.97145422, -11.23479935, 0.31692569]) y: array([ 5.95718132, -16.85219903, 0.47538853])

Reply to this email directly or view it on GitHubhttps://github.com/SimTk/msmbuilder/issues/387#issuecomment-38384867 .

rmcgibbo commented 10 years ago

FWIW, I noticed both of those issues while reimplementing the method, since I wanted sklearn compatible method names (fit, fit_transform, etc). That code is here: https://github.com/rmcgibbo/projector/blob/master/projector/models/tica.py

kyleabeauchamp commented 10 years ago

Don't we want sklearn compatible method names here as well?

kyleabeauchamp commented 10 years ago

(I think the answer is yes.)

kyleabeauchamp commented 10 years ago

But obviously there's no rush, I understand it might be easier to push things externally and later backport to MSMB.

schwancr commented 10 years ago

Actually, my test case is subject to the difference in project. But the point is to say that on line 208, we're incrementing sum_t_dt by sum_t, when I'm pretty sure that's not the right behavior. They should be equal with lag==0

I fixed this one at least