markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
310 stars 119 forks source link

TICA Fail PyEMMAv2.3.1 #1037

Closed TensorDuck closed 7 years ago

TensorDuck commented 7 years ago

I am getting a weird failure on using TICAv.2.3.1 after upgrading, tica was working on this data set previously.

Error is long, due to it being a recursion error, but the last bit of the traceback is this:

File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/coordinates/transform/tica.py", line 55, in _lazy_estimation
    tica_obj._diagonalize()
  File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/coordinates/transform/tica.py", line 297, in _diagonalize
    timescales = self.timescales
  File "<decorator-gen-27>", line 2, in timescales
  File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/coordinates/transform/tica.py", line 55, in _lazy_estimation
    tica_obj._diagonalize()
  File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/coordinates/transform/tica.py", line 291, in _diagonalize
    eigenvalues, eigenvectors = eig_corr(self._covar.cov, self._covar.cov_tau, self.epsilon, sign_maxelement=True)
  File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/_ext/variational/solvers/direct.py", line 152, in eig_corr
    L = spd_inv_split(C0, epsilon=epsilon, method=method, canonical_signs=True)
  File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/_ext/variational/solvers/direct.py", line 63, in spd_inv_split
    assert _np.allclose(W.T, W), 'C0 is not a symmetric matrix'
  File "/home/jc49/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 2372, in allclose
    res = all(isclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan))
  File "/home/jc49/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 2453, in isclose
    return within_tol(x, y, atol, rtol)
  File "/home/jc49/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 2435, in within_tol
    with errstate(invalid='ignore'):
  File "/home/jc49/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 2968, in __enter__
    self.oldstate = seterr(**self.kwargs)
RuntimeError: maximum recursion depth exceeded

I have an example attached in a tar-ball with a python script run_msm.py that reproduces the error for me, all necessary data files included. I also have my conda_list.txt

test-ticafail.tar.gz conda_list.txt

franknoe commented 7 years ago

This appears to be the symmetrization problem again that was related to an mdtraj issue. No idea why it would pop up again in 2.3.1. @marscher, can you have a look? @tensorduck, is this behavior specific to this example, or do you see that problem also in other examples?

Am 07/02/17 um 23:10 schrieb Justin Chen:

I am getting a weird failure on using TICAv.2.3.1 after upgrading, tica was working on this data set previously.

Error is long, due to it being a recursion error, but the last bit of the traceback is this:

|File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/coordinates/transform/tica.py", line 55, in _lazy_estimation tica_obj._diagonalize() File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/coordinates/transform/tica.py", line 297, in _diagonalize timescales = self.timescales File "", line 2, in timescales File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/coordinates/transform/tica.py", line 55, in _lazy_estimation tica_obj._diagonalize() File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/coordinates/transform/tica.py", line 291, in _diagonalize eigenvalues, eigenvectors = eig_corr(self._covar.cov, self._covar.cov_tau, self.epsilon, sign_maxelement=True) File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/_ext/variational/solvers/direct.py", line 152, in eig_corr L = spd_inv_split(C0, epsilon=epsilon, method=method, canonical_signs=True) File "/home/jc49/miniconda2/lib/python2.7/site-packages/pyemma/_ext/variational/solvers/direct.py", line 63, in spd_inv_split assert _np.allclose(W.T, W), 'C0 is not a symmetric matrix' File "/home/jc49/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 2372, in allclose res = all(isclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan)) File "/home/jc49/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 2453, in isclose return within_tol(x, y, atol, rtol) File "/home/jc49/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 2435, in within_tol with errstate(invalid='ignore'): File "/home/jc49/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 2968, in enter self.oldstate = seterr(**self.kwargs) RuntimeError: maximum recursion depth exceeded |

I have an example attached in a tar-ball with a python script |run_msm.py| that reproduces the error for me, all necessary data files included. I also have my conda_list.txt

test-ticafail.tar.gz https://github.com/markovmodel/PyEMMA/files/759185/test-ticafail.tar.gz conda_list.txt https://github.com/markovmodel/PyEMMA/files/759186/conda_list.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/1037, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQk-yWlLdiNEdHlmTRL_LtXxjfqrUks5raOvdgaJpZM4L6G4l.

--


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany

TensorDuck commented 7 years ago

I see this in a few similar trajectories, all of them coarse-grained models though. I will test with some all-atom data as well and see if there is a difference.

marscher commented 7 years ago

@TensorDuck: thanks for taking the time to report this with a reproducing example.

@franknoe: maximum-recursion-depth reached...This has nothing to do with mdtraj. Please tell me to which specific issue you are referring to?

Its an endless recursion, because of the decorator voodoo of lazy estimation. I introduced this code for partial_fit support to avoid early diagonalization and only perform it, if properties depending on the result are accessed. But in the recent version the decorator is used incorrectly. To clarify this:

TICA gets estimated, which calls diagonalize. So far so, so good. But within the new implementation of diag(), the lazy property self.timescales is accessed. So it will again call diagonolize, because self._estimated is still not true (this would happen only if the diagonalization is finished).

It is a total fail of our test suite. A total disaster 👎 The funny thing is really, why this even works in most cases. The only condition in which it can work is, If tica_obj._estimated is true before diag() is invoked the decorator does not call it again. So this condition is met most of the time.

a snippet like this succeeds: inp=pyemma.coordinates.source(np.random.random((100,3))) pyemma.coordinates.tica(inp,1).get_output()

but this one re-produces the error: pyemma.coordinates.tica(r,1,commute_map=True, kinetic_map=False).get_output()

So this new feature is simply not well tested. The commute_map=True code branch accesses self.timescales -> recursive call to diag()... just use what have computed in eig_corr like

if self.commute_map:
   timescales = -self.lag / np.log(np.abs(eigenvalues))

BTW: this is definition of the property self.eigenvalues, but it leads to NaN in output:

import numpy as np
import pyemma
r=pyemma.coordinates.source(x)
x=np.random.random((100,3))
r=pyemma.coordinates.source(x)
########
In [10]: pyemma.coordinates.tica(r,1,commute_map=True, kinetic_map=False).eigenvalues
/home/marscher/workspace/pyemma/pyemma/coordinates/transform/tica.py:301: RuntimeWarning: invalid value encountered in sqrt
  eigenvectors *= np.sqrt(regularized_timescales / 2)
Out[10]: array([ 0.2235185 , -0.20373609, -0.12104746])
In [11]: pyemma.coordinates.tica(r,1,commute_map=True, kinetic_map=False).timescales
/home/marscher/workspace/pyemma/pyemma/coordinates/transform/tica.py:301: RuntimeWarning: invalid value encountered in sqrt
  eigenvectors *= np.sqrt(regularized_timescales / 2)
Out[11]: array([ 0.66744041,  0.62856324,  0.47358068])

so we have real eigenpairs. But how can the output then be not valid? I think a lot more things have been broken recently and are most likely also untested.

Sorry @TensorDuck that you ran into this problem! You must have wasted hours of cpu time to reach 1000 recursions (diagonalizing your matrix takes 13 seconds on my old box ~ I would have wasted 4 hours). Please don't use the commute_map=True flag by now - it is broken.

References: *https://codecov.io/gh/markovmodel/PyEMMA/src/d23dcaab9460410f200714694641702ff3932a78/pyemma/coordinates/transform/tica.py#L296

PS: untested code is broken code. Probably we should go back to have 3 release candidates again (like for branch 2.1) before we trigger the release button.

@franknoe sorry, we should have done a better job.

TensorDuck commented 7 years ago

Ah, I see. So I wasn't going crazy. I think I will probably roll back a few versions for now then.

marscher commented 7 years ago

On 08.02.2017 04:04, Justin Chen wrote:

Ah, I see. So I wasn't going crazy. I think I will probably roll back a few versions for now then. I just wanted to point out how serious is the issue is. If people use this flag in analysis on a cluster, they will just hit the wall time.

This has been introduced in the stress of paper deadlines/phd thesis wrap up. Please accept our apologies.

I recommend 2.2.7 at the moment.

PS: sorry if you felt offended by that comment [I have toothache and its 4 o'clock in the morning].

TensorDuck commented 7 years ago

I will roll back to 2.2.7 then. Thank you to everyone for y'all's help.

franknoe commented 7 years ago

Not sure I understand, I'm also not sure if a decorator is a good way to do this. Decorator code is hard to read for most developers and programming logic should be contained in easily accessible code, ideally in the class in which it acts.

In any case I agree about the general stuff, i.e. release-candidates or devel versions (whichever is easier to do) that are only available on demand and have been used for a while in the group before the public release is made, and code testing. Also I hope to have more time for helping with software soon, I think many recent errors are due to a too complicated software design and class dependencies in the base framework.

Practically, since TICA is a very basic feature we need a functional version rather quick. @fabian-paul, can you help with this?

Am 08/02/17 um 03:38 schrieb Martin K. Scherer:

@TensorDuck https://github.com/TensorDuck: thanks for taking the time to report this with a reproducing example.

@franknoe https://github.com/franknoe: maximum-recursion-depth reached...This has nothing to do with mdtraj. Please tell me to which specific issue you are referring to?

Its an endless recursion, because of the decorator voodoo of lazy estimation. I introduced this code for partial_fit support to avoid early diagonalization and only perform it, if properties depending on the result are accessed. But in the recent version the decorator is used incorrectly. To clarify this:

TICA gets estimated, which calls diagonalize. So far so, so good. But within the new implementation of diag(), the lazy property self.timescales is accessed. So it will again call diagonolize, because self._estimated is still not true (this would happen only if the diagonalization is finished).

It is a total fail of our test suite. A total disaster 👎 The funny thing is really, why this even works in most cases. The only condition in which it can work is, If tica_obj._estimated is true before diag() is invoked the decorator does not call it again. So this condition is met most of the time.

a snippet like this succeeds: inp=pyemma.coordinates.source(np.random.random((100,3))) pyemma.coordinates.tica(inp,1).get_output()

but this one re-produces the error: pyemma.coordinates.tica(r,1,commute_map=True, kinetic_map=False).get_output()

So this new feature is simply not well tested. The commute_map=True code branch accesses self.timescales -> recursive call to diag()... just use what have computed in eig_corr like

|if self.commute_map: timescales = -self.lag / np.log(np.abs(eigenvalues)) |

BTW: this is definition of the property self.eigenvalues, but it leads to NaN in output:

|import numpy as np import pyemma r=pyemma.coordinates.source(x) x=np.random.random((100,3)) r=pyemma.coordinates.source(x) ######## In

kinetic_map=False).eigenvalues /home/marscher/workspace/pyemma/pyemma/coordinates/transform/tica.py:301: RuntimeWarning: invalid value encountered in sqrt eigenvectors = np.sqrt(regularized_timescales / 2) Out10: array([ 0.2235185 , -0.20373609, -0.12104746]) | |In [11]: pyemma.coordinates.tica(r,1,commute_map=True, kinetic_map=False).timescales /home/marscher/workspace/pyemma/pyemma/coordinates/transform/tica.py:301: RuntimeWarning: invalid value encountered in sqrt eigenvectors = np.sqrt(regularized_timescales / 2) Out[11]: array([ 0.66744041, 0.62856324, 0.47358068]) |

so we have real eigenpairs. But how can the output then be not valid? I think a lot more things have been broken recently and are most likely also untested.

Sorry @TensorDuck https://github.com/TensorDuck that you ran into this problem! You must have wasted hours of cpu time to reach 1000 recursions (diagonalizing your matrix takes 13 seconds on my old box ~ I would have wasted 4 hours). Please don't use the commute_map=True flag by now - it is broken.

References: *https://codecov.io/gh/markovmodel/PyEMMA/src/d23dcaab9460410f200714694641702ff3932a78/pyemma/coordinates/transform/tica.py#L296

PS: untested code is broken code. Probably we should go back to have 3 release candidates again (like for branch 2.1) before we trigger the release button.

@franknoe https://github.com/franknoe sorry, we should have done a better job.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/1037#issuecomment-278212295, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQptO-7LVUh1Egj-eEQVhjLmcH1rBks5raSqYgaJpZM4L6G4l.

--


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany

fabian-paul commented 7 years ago

I don't think you need my help. @marscher correctly identified the problem. commute_map simply wasn't tested.

Concerning @marscher comment on the RuntimeWarning, there seems to be a general problem with @franknoe 's regularization formula. Consider this example:

lag = 1.0
timescales = np.array([1000,1,0.1])
print 0.5 * timescales * np.tanh(np.pi * ((timescales - lag) / lag) + 1)
[  5.00000000e+02   3.80797078e-01  -4.74787185e-02]

This produces negative numbers, which I think wasn't intended. Let me ponder on this...

franknoe commented 7 years ago

Is it only a commute map problem, or is it a general problem with partial fit.

Am 08/02/17 um 14:29 schrieb fabian-paul:

I don't think you need my help. @marscher https://github.com/marscher correctly identified the problem. commute_map simply wasn't tested.

Concerning @marscher https://github.com/marscher comment on the RuntimeWarning, there seems to be a general problem with @franknoe https://github.com/franknoe 's regularization formula. Consider this example:

lag= 1.0 timescales= np.array([1000,1,0.1]) print 0.5 timescales np.tanh(np.pi* ((timescales- lag)/ lag)+ 1) |[ 5.00000000e+02 3.80797078e-01 -4.74787185e-02] |

This produces negative numbers, which I think wasn't intended. Let me ponder on this...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/1037#issuecomment-278328817, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQsIaXjCqK5h6dTtFAqdOUQM8W6r-ks5racNHgaJpZM4L6G4l.

--


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany

marscher commented 7 years ago

On 08.02.2017 14:34, Frank Noe wrote:

Is it only a commute map problem, or is it a general problem with partial fit. Fabians example is pure numpy ;-) How do you relate it to partial_fit?

Am 08/02/17 um 14:29 schrieb fabian-paul:

I don't think you need my help. @marscher https://github.com/marscher correctly identified the problem. commute_map simply wasn't tested.

Concerning @marscher https://github.com/marscher comment on the RuntimeWarning, there seems to be a general problem with @franknoe https://github.com/franknoe 's regularization formula. Consider this example:

lag= 1.0 timescales= np.array([1000,1,0.1]) print 0.5 timescales np.tanh(np.pi* ((timescales- lag)/ lag)+ 1) |[ 5.00000000e+02 3.80797078e-01 -4.74787185e-02] |

This produces negative numbers, which I think wasn't intended. Let me ponder on this...

franknoe commented 7 years ago

Your explanation was that there is a problem that depends upon whether the estimator has been estimated or not. This doesn't sound like it it specific to the commute map.

Am 08/02/17 um 15:12 schrieb Martin K. Scherer:

On 08.02.2017 14:34, Frank Noe wrote:

Is it only a commute map problem, or is it a general problem with partial fit. Fabians example is pure numpy ;-) How do you relate it to partial_fit?

Am 08/02/17 um 14:29 schrieb fabian-paul:

I don't think you need my help. @marscher https://github.com/marscher correctly identified the problem. commute_map simply wasn't tested.

Concerning @marscher https://github.com/marscher comment on the RuntimeWarning, there seems to be a general problem with @franknoe https://github.com/franknoe 's regularization formula. Consider this example:

lag= 1.0 timescales= np.array([1000,1,0.1]) print 0.5 timescales np.tanh(np.pi* ((timescales- lag)/ lag)+ 1) |[ 5.00000000e+02 3.80797078e-01 -4.74787185e-02] |

This produces negative numbers, which I think wasn't intended. Let me ponder on this...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/1037#issuecomment-278338828, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQhB27fWLmKXfyGM89kuq-PG-BuzFks5rac05gaJpZM4L6G4l.

--


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany

fabian-paul commented 7 years ago

@franknoe : it's only a commute map problem. I've added a combined test #1039 that tests commute_map with partial fit. @marscher : the numpy code is a copy from https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/transform/tica.py#L300 The point I tried to make it that the code is ok, but the math is wrong.

fabian-paul commented 7 years ago

I think @marscher 's explanation is partially (sic!) wrong "The only condition in which it can work is, If tica_obj._estimated is true before diag() is invoked the decorator does not call it again. So this condition is met most of the time." I don't agree. This condition is not met most of the time. When _diagonalize() is entered self._estimated usually is False. (I checked this by adding assert not self._estimated as the first line of _diagonalize.) It's a commute map problem.

fabian-paul commented 7 years ago

For the regularization, a quick'n'dirty way of solving the problem of negative implied time scales would be to truncate the regularized timescales at zero: 0.5 * timescales * np.maximum(np.tanh(np.pi * ((timescales - lag) / lag) + 1),0))

Talking about hyperbolic functions... why not regularize with the hyperbola directly, without any mental gymnastics? lag*np.sqrt((np.maximum((timescales/lag)**2 - 1,0)))