Simplify TICA Parameters

franknoe commented 7 years ago

tica's init method has four parameters that will determine the output dimension, and that have accumulated over time as different methods were added (especially kinetic maps and commute maps):

dim=-1
var_cutoff=0.95
kinetic_map=True
commute_map=False

Now that's a bit confusing, because these parameters are interdependent, and also the current default behavior is not the most sensible. By default, I will get a scaling by TICA eigenvalues (to a kinetic map at the selected lag time), and the output dimension will be selected such that the kinetic variance adds up to 95%. If I instead want to select a fixed output dimension, I would have, in principle to set two values: dim=10, var_cutoff=1.0. To avoid that, we check if the var_cutoff was set away from the default, i.e. setting dim=10, var_cutoff=0.95 will end up in 10 dimensions, even if the variance is 99%. That's confusing. Moreover, the current situation allows illegal or inconsistent settings such as kinetic_map=True, commute_map=True.

Looking at the sklearn impl of PCA, which has to address the same situation, here's a suggestion how to solve it. We only keep the following two parameters to control the output dimension and scaling:

   """
        dim : float or int
            Number of dimensions to keep:
            * if dim is not set all available ranks are kept::
                n_components == min(n_samples, n_features)
            * if dim is an integer >= 1, this number specifies the number
              of dimensions to keep. By default this will use the kinetic
              variance unless scaling=`commute map` is selected.
            * if dim is a float with ``0 < dim < 1``, select the number
              of dimensions such that the amount of kinetic variance 
              that needs to be explained is greater than the percentage 
              specified by dim.
        scaling : None or string
            Scaling to be applied to the TICA modes upon transformation
            * None: no scaling will be applied, variance along the mode is 1
            * 'kinetic map' or 'km': modes are scaled by eigenvalue
            * 'commute map' or 'cm': modes are scaled by :math:`sqrt(t_i/2)`, 
              where :math:`t_i=-lag/|lambda_i|` is the relaxation time computed 
              from the eigenvalue :math:`lambda_i`.
   """

That way, dim is the only parameter that determines the output dimension, and it can be set in different ways to do that. scaling is the only parameter that determines the output of transform(X) or get_output(), it could get more complex (i.e. additional paramters in the string, or accepting a dict), if we wanted to encode more, such as the function use to penalize small eigenvalues.

Of course we would need to introduce this behavior "smoothly", i.e. without killing the other parameters immediately, but we could enable the new parameter behavior first, deprecate the old parameters and remove them in a later version.

We would do the same changes in pca, although the situation there is a bit simpler.

What do you think? Is it worth making this change in PyEMMA 2.*, or does it add confusion?

j3mdamas commented 7 years ago

It sounds good, as long as it is robust and correctly deprecated (which may not be easy, as you suggest). Maybe it should be clear in the docstring that 0<dim<1 only works with scaling!=None.

franknoe commented 7 years ago

I think I would still make it working with scaling=None, in that case the contribution is computed by kinetic map (current PyEMMA default), just that the actual output is not scaled. Or would that be confusing?

Am 01/05/17 um 16:40 schrieb João M. Damas:

It sounds good, as long as it is robust and correctly deprecated (which may not be easy, as you suggest). Maybe it should be clear in the docstring that |0<dim<1| only works with |scaling!=None|.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/1075#issuecomment-298345177, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQiIXRxG6bbd9yjQ-uc0O-Eqd2X9fks5r1e73gaJpZM4NMYLd.

--

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany

j3mdamas commented 7 years ago

That is confusing. If scaling=None, I think the kinetic variance cut-off should then work unscaled. No? Otherwise, what is the purpose of scaling=None?

franknoe commented 7 years ago

Scaling defines whether the output coordinates are scaled (and how). By default they would have variance 1, otherwise they are scaled by the eigenvalues or by sqrt(t_i/2), where t_i is the timescale. But using scaled variables doesn't mean to have to use this scale in order to determine the number of output variables. You could use scaled variables, but still fix their number.

The other way around is not so clear. If you request unscaled output but still ask the number to be determined by the explained variance, it's undefined which variance should be used (kinetic variance, i.e. the sum of squared eigenvalues or commute variance, i.e. half the sum of timescales). I would use the kinetic variance there because it's more basic.

An alternative (as you suggest) would be to prohibit the setting scaling=None and dim<1, but then the usecase of getting unscaled coordinates while determining their number using the explained variance is not covered. It's probably not a super common usecase, so I guess it would be fine to not have it.

Am 01/05/17 um 17:14 schrieb João M. Damas:

That is confusing. If |scaling=None|, I think the kinetic variance cut-off should then work unscaled. No? Otherwise, what is the purpose of |scaling=None|?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/1075#issuecomment-298351807, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQsInAVMpUV-90b4G8hvKX1OgU4L9ks5r1fa_gaJpZM4NMYLd.

--

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany

j3mdamas commented 7 years ago

Oh, I thought it worked like PCA, where each component contribution to the total variability (which I thought was equivalent to kinetic variance) can be calculated, for the unscaled case.

franknoe commented 7 years ago

It's conceptually the same. How to scaling is a matter of convention and depends on what you want to do with the data, but selecting the number of dimensions is mostly independent of that choice.

In PCA the eigenvalues are the variances of the data along the eigenvector directions. The PCA eigenvectors come from some solver, so at first the scaling is arbitrary. There are at least two scaling conventions that would make sense, one is the one that whitens that data, i.e. scales the data to a variance of 1 along the component when you project the data onto the respective eigenvector, another one would be to maintain the variance, such that you can reconstrunct the original (unscaled) data with the dominant components. The first option is often a useful preprocessing step before inserting the data into machine learning algorithms, the second option is good for dimension reduction when you want to leave the data otherwise unchanged. Independently of how you do the scaling, you can select the number of components that you want to keep either manually or by the explained variance, and for that you would use the eigenvalues.

In TICA the situation is similar. By default TIC's are white (variance one and uncorrelated), because these constraints are used in the variational principle. To obtain a notion of "variance" one must define a distance metric and the two ways we have thought of doing this are the kinetic map / kinetic variance (which came out of a distance definition made by the diffusion map people), and the commute map distance which has the nice property that Euclidean distances are approximately proportional to commute times. Again, we could use either one to scale the coordinates (or not scale them at all, which corresponds to whitening), independent of how many coordinates we choose.

Am 01/05/17 um 17:57 schrieb João M. Damas:

Oh, I thought it worked like PCA, where each component contribution to the total variability (which I thought was equivalent to kinetic variance) can be calculated, for the unscaled case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/1075#issuecomment-298360887, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQur2lVAmGwdjXLTBclGO2r1wqf7xks5r1gDtgaJpZM4NMYLd.

--

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany

j3mdamas commented 7 years ago

Well, then the scaling=None should correspond to the whitening you're talking about, no?

franknoe commented 7 years ago

exactly.

Am 01/05/17 um 21:13 schrieb João M. Damas:

Well, then the |scaling=None| should correspond to the whitening you're talking about, no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/1075#issuecomment-298406293, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQsCddEDApLsjXaPGPlUcA0SsapPKks5r1i7XgaJpZM4NMYLd.

--

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany

j3mdamas commented 7 years ago

That makes sense.

On May 1, 2017 9:15 PM, "Frank Noe" notifications@github.com wrote:

exactly.

Am 01/05/17 um 21:13 schrieb João M. Damas:

Well, then the |scaling=None| should correspond to the whitening you're talking about, no?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/ 1075#issuecomment-298406293, or mute the thread https://github.com/notifications/unsubscribe-auth/ AGMeQsCddEDApLsjXaPGPlUcA0SsapPKks5r1i7XgaJpZM4NMYLd.

--

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 <+49%2030%2083875354> Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/1075#issuecomment-298406875, or mute the thread https://github.com/notifications/unsubscribe-auth/AKTzQmAkGB8LEpdTu3j9LvTy7-GN3cZlks5r1i9cgaJpZM4NMYLd .

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

marscher commented 4 years ago

This would be fixed by #1366, but it is still not reviewed and I refuse to merge it without one.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

markovmodel / PyEMMA