msmbuilder / msmbuilder-legacy

Legacy release of MSMBuilder
http://msmbuilder.org
GNU General Public License v2.0
25 stars 28 forks source link

CalculateImpliedTimescales.py crashes using assignments given by AssignHierarchical.py #296

Open s-gordon opened 10 years ago

s-gordon commented 10 years ago

Following on from the issue I raised earlier (#295), I'm having troubles with the CalculateImpliedTimescales.py script when working with assignments generated by AssignHierarchical.py. This does not seem to occur when using assignments generated using rmsd hybrid clustering.

The following is the typical output I'm getting when executing the script:

CalculateImpliedTimescales.py -a Data1/Assignments.h5 -l 1,100 -i 5 -o Data1/ImpliedTimescales.dat
--------------------------------------------------------------------------------

MSMBuilder version 2.7.dev.dev-Unknown

See file AUTHORS for a list of MSMBuilder contributors.

--------------------------------------------------------------------------------

Copyright 2011 Stanford University.

MSMBuilder comes with ABSOLUTELY NO WARRANTY.

MSMBuilder is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.

--------------------------------------------------------------------------------

Please cite the following references:

GR Bowman, X Huang, and VS Pande. Methods 2009. Using generalized ensemble 
simulations and Markov state models to identify conformational states.

KA Beauchamp, GR Bowman, TJ Lane, L Maibaum, IS Haque, VS Pande.  JCTC 2011.
MSMBuilder2: Modeling Conformational Dynamics
at the Picosecond to Millisecond Timescale

IS Haque, KA Beauchamp, VS Pande.  In preparation.
A Fast 3 x N Matrix Multiply Routine for Calculation of Protein RMSD.

--------------------------------------------------------------------------------
{'assignments': 'Data1/Assignments.h5',
 'eigvals': 10,
 'interval': 5,
 'lagtime': '1,100',
 'notrim': False,
 'output': 'Data1/ImpliedTimescales.dat',
 'procs': 1,
 'quiet': False,
 'symmetrize': 'MLE'}
21:37:54 - Getting 10 eigenvalues (timescales) for each lagtime...
21:37:54 - Building MSMs at the following lag times: [1, 6, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61, 66, 71, 76, 81, 86, 91, 96]
21:37:54 - Calculating implied timescales at lagtime 1
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
...
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
Traceback (most recent call last):
  File "/usr/local/bin/CalculateImpliedTimescales.py", line 5, in <module>
    pkg_resources.run_script('msmbuilder==2.7.dev', 'CalculateImpliedTimescales.py')
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1235, in run_script
    execfile(script_filename, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/CalculateImpliedTimescales.py", line 82, in <module>
    (not args.notrim), args.symmetrize, args.procs)
  File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/CalculateImpliedTimescales.py", line 64, in run
    trimming=trimming, symmetrize=symmetrize, n_procs=nProc)
  File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/msmbuilder/msm_analysis.py", line 185, in get_implied_timescales
...
lags = result.get(999999)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
...
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)

It simply cuts out after the last line without writing out ImpliedTimescales.dat.

kyleabeauchamp commented 10 years ago

So I'm not sure which part of the calculation is crashing, but this does happen sometimes.

I think the easiest workaround for now is to use fewer lagtimes or fewer states. I think this bug tends to happen more at longer lagtimes or with more states, but I'm not 100% sure.

s-gordon commented 10 years ago

Hmmm. I've just tried halving the number of states from ~1700 to 800, and this is what I'm getting now:

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/msmbuilder/MSMLib.py:592: RuntimeWarning: invalid value encountered in double_scalars
  logger.info("Selected component %d with population %f", ComponentInd, ComponentPops[ComponentInd] / ComponentPops.sum())
10:17:31 - Selected component 0 with population nan
10:17:31 - Calculating implied timescales at lagtime 15
Traceback (most recent call last):
  File "/usr/local/bin/CalculateImpliedTimescales.py", line 5, in <module>
    pkg_resources.run_script('msmbuilder==2.7.dev', 'CalculateImpliedTimescales.py')
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1235, in run_script
    execfile(script_filename, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/CalculateImpliedTimescales.py", line 82, in <module>
    (not args.notrim), args.symmetrize, args.procs)
  File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/EGG-INFO/scripts/CalculateImpliedTimescales.py", line 64, in run
    trimming=trimming, symmetrize=symmetrize, n_procs=nProc)
  File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.7.dev-py2.7-linux-x86_64.egg/msmbuilder/msm_analysis.py", line 185, in get_implied_timescales
    lags = result.get(999999)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
    raise self._value
IndexError: invalid index

Line 5 worries me the most. Further decreasing the number of states does not seem to solve the problem, although I haven't seen any crashes yet.

s-gordon commented 10 years ago

Nevermind; I think I've found a solution. The script requires that only lag times which are multiples of the stride (50 in my case) be sampled. All other lag times either result in the script crashing for me or returning an "IndexError: invalid index" error.

While I've found a solution, I'm not sure that I understand the philosophy behind it...

kyleabeauchamp commented 10 years ago

OK, I think I know what's going on. It's actually not possible to extend a hierarchical clustering to lagtimes that are more frequent than the one used during clustering. This is because there is no concept of "generator" or "cluster center" in hierarchical clustering.

For k-centers, k-medoids, and hybrid, there IS the concept of a generator, which allows you to transfer (or apply) your clustering to new data.

Regardless, you should get a more descriptive error here, which we will fix.

s-gordon commented 10 years ago

Hmmmm. Some food for thought.

Thanks for the help.