Closed jchodera closed 8 years ago
I think either @marscher oder @clonker are the experts on this...
Am 05/04/16 um 21:42 schrieb John Chodera:
I'm not sure if this is a silly mistake on my part or an issue with the chunking, but I've run into an odd error in minRMSD clustering of a large dataset:
|[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7065714 Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. Mon Apr 4 17:10:36 EDT 2016 loading reference topology... Initializing featurizer... There are 988951 frames total in 557 trajectories. Clustering... Traceback (most recent call last): File "cluster.py", line 51, in
dtrajs = clustering.dtrajs File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.0.4-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/interface.py", line 73, in dtrajs self._dtrajs = self.assign(stride=1) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.0.4-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/interface.py", line 180, in assign mapped = self.get_output(stride=stride) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.0.4-py2.7-linux-x86_64.egg/pyemma/coordinates/transform/transformer.py", line 784, in get_output trajs[itraj][t:t + L, :] = chunk[:, dimensions] ValueError: could not broadcast input array from shape (100,1) into shape (60,1) | Here is the simple clustering script:
|#!/usr/bin/env python import pyemma import numpy as np import mdtraj import os # Source directory source_directory = '/cbio/jclab/projects/fah/fah-data/munged-with-time/no-solvent/11406'
CK2
################################################################################
Load reference topology
################################################################################ print ('loading reference topology...') reference_pdb_filename = 'protein.pdb' reference_trajectory = os.path.join(source_directory, 'run0-clone0.h5') traj = mdtraj.load(reference_trajectory) traj[0].save_pdb(reference_pdb_filename) ################################################################################
Initialize featurizer
################################################################################ print('Initializing featurizer...') import pyemma.coordinates featurizer = pyemma.coordinates.featurizer(reference_pdb_filename) featurizer.add_all() ################################################################################
Define coordinates source
################################################################################ import pyemma.coordinates from glob import glob trajectory_filenames = glob(os.path.join(sourcedirectory, 'run-clone_.h5')) coordinates_source = pyemma.coordinates.source(trajectory_filenames, features=featurizer) print("There are %d frames total in %d trajectories." % (coordinates_source.n_frames_total(), coordinates_source.number_of_trajectories())) ################################################################################
Cluster
################################################################################ print('Clustering...') generator_ratio = 100 nframes = coordinates_source.n_frames_total() nstates = int(nframes / generator_ratio) stride = 4 metric = 'minRMSD' clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric) dtrajs = clustering.dtrajs # Save discrete trajectories. dtrajs_dir = 'dtrajs' clustering.save_dtrajs(output_dir=dtrajs_dir, output_format='npy', extension='.npy') |
Unfortunately, the dataset is enormous, but I'm happy to get someone an account on our cluster if that would make things easier.
I'm trying to find a minimum example that demonstrates the same issue.
I'm using |pyemma| 2.0.4
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/759
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
Please first try this again with 2.1 or devel because this test case may be affected by the changes. If still a problem, we'll look into details.
Rerunning with 2.1---thanks!
With pyemma
2.1, here's what I get now:
[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7066737
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Wed Apr 6 12:58:02 EDT 2016
loading reference topology...
Initializing featurizer...
Obtaining file info: 100% (558/558) [##############################] eta 00:01 -There are 1006751 frames total in 558 trajectories.
Clustering...
Traceback (most recent call last):
File "cluster.py", line 50, in <module>
clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric)
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 1339, in cluster_uniform_time
return _param_stage(data, res, stride=stride)
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 815, in _param_stage
this_stage.estimate(X=input_stage, stride=stride)
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/transform/transformer.py", line 189, in estimate
super(StreamingTransformer, self).estimate(X, **kwargs)
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/_base/estimator.py", line 343, in estimate
self._model = self._estimate(X)
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/uniform_time.py", line 85, in _estimate
self.clustercenters = np.concatenate([X for X in it])
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/_base/datasource.py", line 636, in __exit__
self.close()
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/feature_reader.py", line 334, in close
self._logger.debug('closing current trajectory "%s"' % self._data_source.filenames[self._itraj])
IndexError: list index out of range
I realize that (1) this is a lot of data, at 1M frames, and (2) I think the trajectories are probably continuing to grow daily as we sync more data from our FAH servers. I wonder if issue (2) was causing the first issue, since we may have run into a problem where the trajectory is accessed once and its length is noted, but then is longer when we try to open it again.
Rather than having to make a copy of the entire 45G dataset, I wonder if there is a way to ensure that a consistent trajectory length is recorded and used throughout analysis.
Hmm, this time the clustering fails rather than the assignment. I guess if the trajectories are changing while clustering is in progress, that is very likely to lead to problems in various places.
I think it would be good to first see if what you're trying to do works without a moving target. Could you make a static copy of a subset of data and try with that?
Am 06/04/16 um 20:12 schrieb John Chodera:
With |pyemma| 2.1, here's what I get now:
|[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7066737 Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. Wed Apr 6 12:58:02 EDT 2016 loading reference topology... Initializing featurizer... Obtaining file info: 100% (558/558) [##############################] eta 00:01 -There are 1006751 frames total in 558 trajectories. Clustering... Traceback (most recent call last): File "cluster.py", line 50, in
clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 1339, in cluster_uniform_time return _param_stage(data, res, stride=stride) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 815, in _param_stage this_stage.estimate(X=input_stage, stride=stride) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/transform/transformer.py", line 189, in estimate super(StreamingTransformer, self).estimate(X, **kwargs) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/_base/estimator.py", line 343, in estimate self._model = self._estimate(X) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/uniform_time.py", line 85, in _estimate self.clustercenters = np.concatenate([X for X in it]) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/_base/datasource.py", line 636, in exit self.close() File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/feature_reader.py", line 334, in close self._logger.debug('closing current trajectory "%s"' % self._data_source.filenames[self._itraj]) IndexError: list index out of range | I realize that (1) this is a lot of data, at 1M frames, and (2) I think the trajectories are probably continuing to grow daily as we sync more data from our FAH servers. I wonder if issue (2) was causing the first issue, since we may have run into a problem where the trajectory is accessed once and its length is noted, but then is /longer/ when we try to open it again.
Rather than having to make a copy of the entire 45G dataset, I wonder if there is a way to ensure that a consistent trajectory length is recorded and used throughout analysis.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/759#issuecomment-206498600
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
I've copied the whole dataset over to a place where it will not be changing during the run, and have re-run the analysis, getting the same error:
[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7066909
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Wed Apr 6 17:35:33 EDT 2016
loading reference topology...
Initializing featurizer...
Obtaining file info: 100% (558/558) [##############################] eta 00:01 -There are 1006751 frames total in 558 trajectories.
Clustering...
Traceback (most recent call last):
File "cluster.py", line 51, in <module>
clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric)
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 1339, in cluster_uniform_time
return _param_stage(data, res, stride=stride)
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 815, in _param_stage
this_stage.estimate(X=input_stage, stride=stride)
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/transform/transformer.py", line 189, in estimate
super(StreamingTransformer, self).estimate(X, **kwargs)
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/_base/estimator.py", line 343, in estimate
self._model = self._estimate(X)
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/uniform_time.py", line 85, in _estimate
self.clustercenters = np.concatenate([X for X in it])
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/_base/datasource.py", line 636, in __exit__
self.close()
File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/feature_reader.py", line 334, in close
self._logger.debug('closing current trajectory "%s"' % self._data_source.filenames[self._itraj])
IndexError: list index out of range
Wed Apr 6 18:09:15 EDT 2016
It could be that some of the files were being appended to during the copy and ended up in an inconsistent state, of course, but wouldn't the failure have occurred earlier in the process if that were the case?
Thanks. I think this is a real bug then. Not sure who's in charge here, but pls look into this guys.
Am 07/04/16 um 01:37 schrieb John Chodera:
I've copied the whole dataset over to a place where it will not be changing during the run, and have re-run the analysis, getting the same error:
|[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7066909 Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. Wed Apr 6 17:35:33 EDT 2016 loading reference topology... Initializing featurizer... Obtaining file info: 100% (558/558) [##############################] eta 00:01 -There are 1006751 frames total in 558 trajectories. Clustering... Traceback (most recent call last): File "cluster.py", line 51, in
clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 1339, in cluster_uniform_time return _param_stage(data, res, stride=stride) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/api.py", line 815, in _param_stage this_stage.estimate(X=input_stage, stride=stride) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/transform/transformer.py", line 189, in estimate super(StreamingTransformer, self).estimate(X, **kwargs) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/_base/estimator.py", line 343, in estimate self._model = self._estimate(X) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/clustering/uniform_time.py", line 85, in _estimate self.clustercenters = np.concatenate([X for X in it]) File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/_base/datasource.py", line 636, in exit self.close() File "/cbio/jclab/home/chodera/miniconda/lib/python2.7/site-packages/pyEMMA-2.1-py2.7-linux-x86_64.egg/pyemma/coordinates/data/feature_reader.py", line 334, in close self._logger.debug('closing current trajectory "%s"' % self._data_source.filenames[self._itraj]) IndexError: list index out of range Wed Apr 6 18:09:15 EDT 2016 | It could be that some of the files were being appended to /during/ the copy and ended up in an inconsistent state, of course, but wouldn't the failure have occurred earlier in the process if that were the case?
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/759#issuecomment-206621621
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
Thank you John!
Am 07.04.2016 um 05:03 schrieb Frank Noe:
Thanks. I think this is a real bug then. Not sure who's in charge here, but pls look into this guys. It looks like a real bug. The itraj index ran out of bounds during the handling of the log file. I guess one should not use _itraj but current_trajindex in this case.
@clonker: what do you think? Should we simply remove this log-statement or try to fix it?
Since the filenames property gets used not only in the log statement (but also, e.g., in _next_chunk), an actual fix would be preferable I guess.
I just noticed that John is working with hdf5 files. Since we haven't worked with that so far, my guess is that we are running into a situation which is not or poorly tested, maybe a Reader that behaves differently than expected.
John could you give us a small h5 file for testing?
Am 08/04/16 um 10:58 schrieb Moritz Hoffmann:
Since the filenames property gets used not only in the log statement (but also, e.g., in _next_chunk), an actual fix would be preferable I guess.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/759#issuecomment-207332898
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
Sure, here you go.
I can also give someone cluster access if they want to test on the whole dataset.
thanks
Am 08/04/16 um 15:40 schrieb John Chodera:
Sure, here you go https://choderalab.squarespace.com/s/pyemma-clustering-example.tgz.
I can also give someone cluster access if they want to test on the whole dataset.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/759#issuecomment-207435808
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
@clonker I dont think the filename propery is the problem here, but the _itraj index which runs out of bounds. Maybe we should think about a setter for _itraj (which also raises StopIteration, in case it ran out of bounds) to ensure it will always be valid.
I guess it is lines 372 and 373 https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/data/feature_reader.py#L372
that can make self._itraj==self.number_of_trajectories()
and later makes access to self._data_source.filenames[self._itraj]
fail.
How is _itraj
supposed to work? Should we allow _itraj==number_of_trajectories()
or not? The current code is contradictory.
_itraj
indeed can become == self.number_of_trajectories()
, however the subsequent access to self._data_source_filenames[self._itraj]
happens only if:
_itraj < self.number_of_trajectories() (== self._data_source.ntraj)
.
The mistake is rather that in the case of _itraj >= self.number_of_trajectories()
no StopIteration()
is raised, which could be fixed in a general way like @marscher suggested above.
@jchodera Actually I was not able to reproduce this chunksize mismatch error with the provided test data. Is it still occurring with the devel version?
Testing right now!
I've managed to finish some clustering---it's MUCH faster with parallel clustering!---but the results we're getting from equitemporal generator selection with 1M snapshots and 10K clusters look, well, terrible. I realize equitemporal clustering isn't optimal, but I'm surprised things are so bad.
I'm going to look into clustering some simpler systems (e.g. alanine dipeptide) with the parallel minRMSD clustering just to make sure it recapitulates expected behavior, but I was just wondering what automated testing is done on the minRMSD clustering code so I don't end up repeating tests you already perform via unit testing.
It appears that the results of equitemporal clustering can vary wildly depending on the number of threads used. I'll open a new issue about this.
I'm not sure if this is a silly mistake on my part or an issue with the chunking, but I've run into an odd error in minRMSD clustering of a large dataset:
Here is the simple clustering script:
Unfortunately, the dataset is enormous, but I'm happy to get someone an account on our cluster if that would make things easier.
I'm trying to find a minimum example that demonstrates the same issue.
I'm using
pyemma
2.0.4