Some difficulties with minRMSD clustering: chunk size shape mismatch

markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂

http://pyemma.org

GNU Lesser General Public License v3.0

311 stars 119 forks source link

Some difficulties with minRMSD clustering: chunk size shape mismatch #759

Closed jchodera closed 8 years ago

jchodera commented 8 years ago

I'm not sure if this is a silly mistake on my part or an issue with the chunking, but I've run into an odd error in minRMSD clustering of a large dataset:

[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7065714
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Mon Apr  4 17:10:36 EDT 2016
loading reference topology...
Initializing featurizer...
There are 988951 frames total in 557 trajectories.
Clustering...
Traceback (most recent call last):
  File "cluster.py", line 51, in <module>
    dtrajs = clustering.dtrajs
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.0.4-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/interface.py", line 73, in dtrajs
    self._dtrajs = self.assign(stride=1)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.0.4-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/interface.py", line 180, in assign
    mapped = self.get_output(stride=stride)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.0.4-py2.7-linux-x86_64.egg/pyemma/coordinates/transform/transformer.py", line 784, in get_output
    trajs[itraj][t:t + L, :] = chunk[:, dimensions]
ValueError: could not broadcast input array from shape (100,1) into shape (60,1)

Here is the simple clustering script:

#!/usr/bin/env python

import pyemma
import numpy as np
import mdtraj
import os

# Source directory
source_directory = '/cbio/jclab/projects/fah/fah-data/munged-with-time/no-solvent/11406' # CK2

################################################################################
# Load reference topology
################################################################################

print ('loading reference topology...')
reference_pdb_filename = 'protein.pdb'
reference_trajectory = os.path.join(source_directory, 'run0-clone0.h5')
traj = mdtraj.load(reference_trajectory)
traj[0].save_pdb(reference_pdb_filename)

################################################################################
# Initialize featurizer
################################################################################

print('Initializing featurizer...')
import pyemma.coordinates
featurizer = pyemma.coordinates.featurizer(reference_pdb_filename)
featurizer.add_all()

################################################################################
# Define coordinates source
################################################################################

import pyemma.coordinates
from glob import glob
trajectory_filenames = glob(os.path.join(source_directory, 'run*-clone*.h5'))
coordinates_source = pyemma.coordinates.source(trajectory_filenames, features=featurizer)
print("There are %d frames total in %d trajectories." % (coordinates_source.n_frames_total(), coordinates_source.number_of_trajectories()))

################################################################################
# Cluster
################################################################################

print('Clustering...')
generator_ratio = 100
nframes = coordinates_source.n_frames_total()
nstates = int(nframes / generator_ratio)
stride = 4
metric = 'minRMSD'
clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric)
dtrajs = clustering.dtrajs

# Save discrete trajectories.
dtrajs_dir = 'dtrajs'
clustering.save_dtrajs(output_dir=dtrajs_dir, output_format='npy', extension='.npy')

Unfortunately, the dataset is enormous, but I'm happy to get someone an account on our cluster if that would make things easier.

I'm trying to find a minimum example that demonstrates the same issue.

I'm using pyemma 2.0.4

franknoe commented 8 years ago

I think either @marscher oder @clonker are the experts on this...

Am 05/04/16 um 21:42 schrieb John Chodera:

I'm not sure if this is a silly mistake on my part or an issue with the chunking, but I've run into an odd error in minRMSD clustering of a large dataset:

|[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7065714 Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. Mon Apr 4 17:10:36 EDT 2016 loading reference topology... Initializing featurizer... There are 988951 frames total in 557 trajectories. Clustering... Traceback (most recent call last): File "cluster.py", line 51, in dtrajs = clustering.dtrajs File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.0.4-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/interface.py", line 73, in dtrajs self._dtrajs = self.assign(stride=1) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.0.4-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/interface.py", line 180, in assign mapped = self.get_output(stride=stride) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.0.4-py2.7-linux-x86_64.egg/pyemma/coordinates/transform/transformer.py", line 784, in get_output trajs[itraj][t:t + L, :] = chunk[:, dimensions] ValueError: could not broadcast input array from shape (100,1) into shape (60,1) |

Here is the simple clustering script:

|#!/usr/bin/env python import pyemma import numpy as np import mdtraj import os # Source directory source_directory = '/cbio/jclab/projects/fah/fah-data/munged-with-time/no-solvent/11406'

CK2

################################################################################

Load reference topology

################################################################################ print ('loading reference topology...') reference_pdb_filename = 'protein.pdb' reference_trajectory = os.path.join(source_directory, 'run0-clone0.h5') traj = mdtraj.load(reference_trajectory) traj[0].save_pdb(reference_pdb_filename) ################################################################################

Initialize featurizer

################################################################################ print('Initializing featurizer...') import pyemma.coordinates featurizer = pyemma.coordinates.featurizer(reference_pdb_filename) featurizer.add_all() ################################################################################

Define coordinates source

################################################################################ import pyemma.coordinates from glob import glob trajectory_filenames = glob(os.path.join(sourcedirectory, 'run-clone_.h5')) coordinates_source = pyemma.coordinates.source(trajectory_filenames, features=featurizer) print("There are %d frames total in %d trajectories." % (coordinates_source.n_frames_total(), coordinates_source.number_of_trajectories())) ################################################################################

Cluster

################################################################################ print('Clustering...') generator_ratio = 100 nframes = coordinates_source.n_frames_total() nstates = int(nframes / generator_ratio) stride = 4 metric = 'minRMSD' clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric) dtrajs = clustering.dtrajs # Save discrete trajectories. dtrajs_dir = 'dtrajs' clustering.save_dtrajs(output_dir=dtrajs_dir, output_format='npy', extension='.npy') |

Unfortunately, the dataset is enormous, but I'm happy to get someone an account on our cluster if that would make things easier.

I'm trying to find a minimum example that demonstrates the same issue.

I'm using |pyemma| 2.0.4

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/759

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

franknoe commented 8 years ago

Please first try this again with 2.1 or devel because this test case may be affected by the changes. If still a problem, we'll look into details.

jchodera commented 8 years ago

Rerunning with 2.1---thanks!

jchodera commented 8 years ago

With pyemma 2.1, here's what I get now:

[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7066737 
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Wed Apr  6 12:58:02 EDT 2016
loading reference topology...
Initializing featurizer...
Obtaining file info: 100% (558/558) [##############################] eta 00:01 -There are 1006751 frames total in 558 trajectories.
Clustering...
Traceback (most recent call last):
  File "cluster.py", line 50, in <module>
    clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 1339, in cluster_uniform_time
    return _param_stage(data, res, stride=stride)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 815, in _param_stage
    this_stage.estimate(X=input_stage, stride=stride)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/transform/transformer.py", line 189, in estimate
    super(StreamingTransformer, self).estimate(X, **kwargs)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/_base/estimator.py", line 343, in estimate
    self._model = self._estimate(X)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/uniform_time.py", line 85, in _estimate
    self.clustercenters = np.concatenate([X for X in it])
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/_base/datasource.py", line 636, in __exit__
    self.close()
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/feature_reader.py", line 334, in close
    self._logger.debug('closing current trajectory "%s"' % self._data_source.filenames[self._itraj])
IndexError: list index out of range

I realize that (1) this is a lot of data, at 1M frames, and (2) I think the trajectories are probably continuing to grow daily as we sync more data from our FAH servers. I wonder if issue (2) was causing the first issue, since we may have run into a problem where the trajectory is accessed once and its length is noted, but then is longer when we try to open it again.

Rather than having to make a copy of the entire 45G dataset, I wonder if there is a way to ensure that a consistent trajectory length is recorded and used throughout analysis.

franknoe commented 8 years ago

Hmm, this time the clustering fails rather than the assignment. I guess if the trajectories are changing while clustering is in progress, that is very likely to lead to problems in various places.

I think it would be good to first see if what you're trying to do works without a moving target. Could you make a static copy of a subset of data and try with that?

Am 06/04/16 um 20:12 schrieb John Chodera:

With |pyemma| 2.1, here's what I get now:

|[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7066737 Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. Wed Apr 6 12:58:02 EDT 2016 loading reference topology... Initializing featurizer... Obtaining file info: 100% (558/558) [##############################] eta 00:01 -There are 1006751 frames total in 558 trajectories. Clustering... Traceback (most recent call last): File "cluster.py", line 50, in clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 1339, in cluster_uniform_time return _param_stage(data, res, stride=stride) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 815, in _param_stage this_stage.estimate(X=input_stage, stride=stride) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/transform/transformer.py", line 189, in estimate super(StreamingTransformer, self).estimate(X, **kwargs) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/_base/estimator.py", line 343, in estimate self._model = self._estimate(X) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/uniform_time.py", line 85, in _estimate self.clustercenters = np.concatenate([X for X in it]) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/_base/datasource.py", line 636, in exit self.close() File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/feature_reader.py", line 334, in close self._logger.debug('closing current trajectory "%s"' % self._data_source.filenames[self._itraj]) IndexError: list index out of range |

I realize that (1) this is a lot of data, at 1M frames, and (2) I think the trajectories are probably continuing to grow daily as we sync more data from our FAH servers. I wonder if issue (2) was causing the first issue, since we may have run into a problem where the trajectory is accessed once and its length is noted, but then is /longer/ when we try to open it again.

Rather than having to make a copy of the entire 45G dataset, I wonder if there is a way to ensure that a consistent trajectory length is recorded and used throughout analysis.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/759#issuecomment-206498600

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

jchodera commented 8 years ago

I've copied the whole dataset over to a place where it will not be changing during the run, and have re-run the analysis, getting the same error:

[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7066909 
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Wed Apr  6 17:35:33 EDT 2016
loading reference topology...
Initializing featurizer...
Obtaining file info: 100% (558/558) [##############################] eta 00:01 -There are 1006751 frames total in 558 trajectories.
Clustering...
Traceback (most recent call last):
  File "cluster.py", line 51, in <module>
    clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 1339, in cluster_uniform_time
    return _param_stage(data, res, stride=stride)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 815, in _param_stage
    this_stage.estimate(X=input_stage, stride=stride)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/transform/transformer.py", line 189, in estimate
    super(StreamingTransformer, self).estimate(X, **kwargs)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/_base/estimator.py", line 343, in estimate
    self._model = self._estimate(X)
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/uniform_time.py", line 85, in _estimate
    self.clustercenters = np.concatenate([X for X in it])
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/_base/datasource.py", line 636, in __exit__
    self.close()
  File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/feature_reader.py", line 334, in close
    self._logger.debug('closing current trajectory "%s"' % self._data_source.filenames[self._itraj])
IndexError: list index out of range
Wed Apr  6 18:09:15 EDT 2016

It could be that some of the files were being appended to during the copy and ended up in an inconsistent state, of course, but wouldn't the failure have occurred earlier in the process if that were the case?

franknoe commented 8 years ago

Thanks. I think this is a real bug then. Not sure who's in charge here, but pls look into this guys.

Am 07/04/16 um 01:37 schrieb John Chodera:

I've copied the whole dataset over to a place where it will not be changing during the run, and have re-run the analysis, getting the same error:

|[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7066909 Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. Wed Apr 6 17:35:33 EDT 2016 loading reference topology... Initializing featurizer... Obtaining file info: 100% (558/558) [##############################] eta 00:01 -There are 1006751 frames total in 558 trajectories. Clustering... Traceback (most recent call last): File "cluster.py", line 51, in clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 1339, in cluster_uniform_time return _param_stage(data, res, stride=stride) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 815, in _param_stage this_stage.estimate(X=input_stage, stride=stride) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/transform/transformer.py", line 189, in estimate super(StreamingTransformer, self).estimate(X, **kwargs) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/_base/estimator.py", line 343, in estimate self._model = self._estimate(X) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/uniform_time.py", line 85, in _estimate self.clustercenters = np.concatenate([X for X in it]) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/_base/datasource.py", line 636, in exit self.close() File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/feature_reader.py", line 334, in close self._logger.debug('closing current trajectory "%s"' % self._data_source.filenames[self._itraj]) IndexError: list index out of range Wed Apr 6 18:09:15 EDT 2016 |

It could be that some of the files were being appended to /during/ the copy and ended up in an inconsistent state, of course, but wouldn't the failure have occurred earlier in the process if that were the case?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/759#issuecomment-206621621

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

marscher commented 8 years ago

Thank you John!

Am 07.04.2016 um 05:03 schrieb Frank Noe:

Thanks. I think this is a real bug then. Not sure who's in charge here, but pls look into this guys. It looks like a real bug. The itraj index ran out of bounds during the handling of the log file. I guess one should not use _itraj but current_trajindex in this case.

@clonker: what do you think? Should we simply remove this log-statement or try to fix it?

clonker commented 8 years ago

Since the filenames property gets used not only in the log statement (but also, e.g., in _next_chunk), an actual fix would be preferable I guess.

franknoe commented 8 years ago

I just noticed that John is working with hdf5 files. Since we haven't worked with that so far, my guess is that we are running into a situation which is not or poorly tested, maybe a Reader that behaves differently than expected.

John could you give us a small h5 file for testing?

Am 08/04/16 um 10:58 schrieb Moritz Hoffmann:

Since the filenames property gets used not only in the log statement (but also, e.g., in _next_chunk), an actual fix would be preferable I guess.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/759#issuecomment-207332898

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

jchodera commented 8 years ago

Sure, here you go.

I can also give someone cluster access if they want to test on the whole dataset.

franknoe commented 8 years ago

thanks

Am 08/04/16 um 15:40 schrieb John Chodera:

Sure, here you go https://choderalab.squarespace.com/s/pyemma-clustering-example.tgz.

I can also give someone cluster access if they want to test on the whole dataset.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/759#issuecomment-207435808

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

marscher commented 8 years ago

@clonker I dont think the filename propery is the problem here, but the _itraj index which runs out of bounds. Maybe we should think about a setter for _itraj (which also raises StopIteration, in case it ran out of bounds) to ensure it will always be valid.

fabian-paul commented 8 years ago

I guess it is lines 372 and 373 https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/data/feature_reader.py#L372 that can make self._itraj==self.number_of_trajectories() and later makes access to self._data_source.filenames[self._itraj] fail. How is _itraj supposed to work? Should we allow _itraj==number_of_trajectories() or not? The current code is contradictory.

clonker commented 8 years ago

_itraj indeed can become == self.number_of_trajectories(), however the subsequent access to self._data_source_filenames[self._itraj] happens only if:

_itraj < self.number_of_trajectories() (== self._data_source.ntraj).

The mistake is rather that in the case of _itraj >= self.number_of_trajectories() no StopIteration() is raised, which could be fixed in a general way like @marscher suggested above.

marscher commented 8 years ago

@jchodera Actually I was not able to reproduce this chunksize mismatch error with the provided test data. Is it still occurring with the devel version?

jchodera commented 8 years ago

Testing right now!

jchodera commented 8 years ago

I've managed to finish some clustering---it's MUCH faster with parallel clustering!---but the results we're getting from equitemporal generator selection with 1M snapshots and 10K clusters look, well, terrible. I realize equitemporal clustering isn't optimal, but I'm surprised things are so bad.

I'm going to look into clustering some simpler systems (e.g. alanine dipeptide) with the parallel minRMSD clustering just to make sure it recapitulates expected behavior, but I was just wondering what automated testing is done on the minRMSD clustering code so I don't end up repeating tests you already perform via unit testing.

plot

jchodera commented 8 years ago

It appears that the results of equitemporal clustering can vary wildly depending on the number of threads used. I'll open a new issue about this.