Closed jchodera closed 8 years ago
None that I'm aware of. I think minRMSD clustering support in PyEMMA is rudimentary, as all efforts have went into approaches such as clustering in TICA subspaces that appears to give much better models than clustering in direct geometrical metrics. However, if anyone is interested in implementing something RMSD-based, we can discuss if and where it would fit in.
I think a suitable strategy for supporting requests like this one without having to re-implement a large list of methods would be to expose a sufficiently general coordinate transformation interface such that you could plug in any clustering algorithm which implements a general clustering interface. It would certainly be useful to have access to e.g. sklearn clustering algorithms through the framework. We'll have a developer meeting tomorrow or wednesday and can discuss this.
Am 04/04/16 um 22:58 schrieb John Chodera:
Any plans for parallel minRMSD clustering, either via |multiprocessing|, OpenMP, or some other parallelization strategy?
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/757
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
None that I'm aware of. I think minRMSD clustering support in PyEMMA is rudimentary, as all efforts have went into approaches such as clustering in TICA subspaces that appears to give much better models than clustering in direct geometrical metrics.
Can we really say this is the case if we can't easily compare them?
However, if anyone is interested in implementing something RMSD-based, we can discuss if and where it would fit in.
I'd love to explore this!
I think a suitable strategy for supporting requests like this one without having to re-implement a large list of methods would be to expose a sufficiently general coordinate transformation interface such that you could plug in any clustering algorithm which implements a general clustering interface. It would certainly be useful to have access to e.g. sklearn clustering algorithms through the framework. We'll have a developer meeting tomorrow or wednesday and can discuss this.
That would be awesome.
Another easy thing to do might be to simply support a parallel scheme in assign
, similar to the n_jobs
argument in timescales_msm
.
Can we really say this is the case if we can't easily compare them?
You can use gmrq cross validation, which we've found supports @franknoe 's claim that rmsd clustering isn't all that great
There is a practical limit for using direct geometrical metrics: in very high-dimensional systems, these metrics will result either in a poor resolution of state space (using a large cutoff) or poor statistics for each states (using a small cutoff). Therefore in general it is desirable to use a subset of coordinates. TICA is supposed to select a meaningful subset, but in practice its unclear whether this subset is optimal in all case and a 'manual' selection of coordinates may be required.
For a general evaluation of datasets I think its good to have fast minRMSD routines available to evaluate the sampling of coordinate space.
You can use gmrq cross validation, which we've found supports @franknoe 's claim that rmsd clustering isn't all that great
Definitely want to compare with GMRQ, but we have to actually be able to cluster a full dataset to compare! @maxentile can't actually cluster our data with MSMBuilder without it throwing a segfault, so we were hoping to get through a dataset with pyemma.
There is a practical limit for using direct geometrical metrics: in very high-dimensional systems, these metrics will result either in a poor resolution of state space (using a large cutoff) or poor statistics for each states (using a small cutoff).
minRMSD-based clustering with a fixed number of generators does not use a specific cutoff, and can automatically discover the low dimensional manifold on which he data actually lives even if it is nonlinear. I think RMSD based methods still have a lot of life left in them, though they may need weighting to suppress components that rapidly relax (much as TICA suppresses these) in order to not require enormous datasets.
Suppressing the rapildly relaxing components in RMSD clustering is an interesting idea. This would result in a combination of minRMSD metrics with an analysis of relaxation times as done in TICA. TICA in minRMSD space instead of euclidean space?
Precisely! @jhprinz has been briefed on this if you want to know more
Since I guess we don't want to parallelize each clustering method / metric pair separately, it looks to me like a what is needed to speed minRMSD-based methods up is a parallel version that computes all pairs of minRMSDs between sets of structures I and J (where I could be the cluster centers and J a chunk of trajectory data). Wouldn't mdtraj be a better place to implement this than PyEMMA?
Am 05/04/16 um 00:33 schrieb John Chodera:
Precisely! @jhprinz https://github.com/jhprinz has been briefed on this if you want to know more
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/757#issuecomment-205523065
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
It seemed like the 'assign' framework was quite general, and the most time consuming part. I wonder if this could be effectively parallelized in a general way within pyemma by simply parallelizing the processing of chunks. Ideally, individual processes would handle sequential streams of chunks for disk caching efficiency, but I am not sure how much control there is over chunking.
I think for a cheap clustering method such as reg-space or k-centers, clustering and assignment is of the same order O(k*T), where k is the number of clusters and T the number of time steps. For other methods such as k-medoids, the clustering step can be more expensive. So we would have to both speed up the clustering and the assignment step in order to get a substantial gain.
However even within the clustering step you have to do CPU-intensive assignment (is that what you meant?). In the clustering step, at any time when you have a data chunk with N trajectory frames and k clusters, you need to compute N*k distances (e.g. minRMSD if that is your metric).
So both the clustering and the a posteriori assignment steps in a discretization scheme would profit form having a low-level parallelization of computing a distance matrix between two sets of structures. I think it would be best to do this at the C or C++ level, which is why I suggested to have this close to mdtraj that I think has a C or C++ impl of minrmsd. Perhaps there is even some code around doing this that I'm not aware of.
Am 05/04/16 um 15:20 schrieb John Chodera:
It seemed like the 'assign' framework was quite general, and the most time consuming part. I wonder if this could be effectively parallelized in a general way within pyemma by simply parallelizing the processing of chunks. Ideally, individual processes would handle sequential streams of chunks for disk caching efficiency, but I am not sure how much control there is over chunking.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/757#issuecomment-205801099
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
It's actually a bit hard for me to follow the control flow in the pyemma
clustering code, but all of the clustering operations go back and forth between a generator/centers update step and an assignment step. If the same assign
path is called for all clustering codes, which already breaks things up by chunks, simply making that part parallel will give you a general parallelization strategy that speeds up the most time-consuming part (cluster assignment).
So both the clustering and the a posteriori assignment steps in a discretization scheme would profit form having a low-level parallelization of computing a distance matrix between two sets of structures. I think it would be best to do this at the C or C++ level, which is why I suggested to have this close to mdtraj that I think has a C or C++ impl of minrmsd. Perhaps there is even some code around doing this that I'm not aware of.
That could also be fast but would take much more effort. Since you are already chunking the data through a common assign
scheme that then dispatches to the lower-level cluster assignment within the chunks, I think you can get a lot of speedup for little work if you simply allow the option of parallelizing the assign
scheme across chunks, ideally having some concept of chunks that come from different coordinate streams.
OK, we'll discuss this tomorrow. Will let you know.
Am 05/04/16 um 16:17 schrieb John Chodera:
It's actually a bit hard for me to follow the control flow in the |pyemma| clustering code, but all of the clustering operations go back and forth between a generator/centers update step and an assignment step. If the same |assign| path is called for all clustering codes, which already breaks things up by chunks, simply making that part parallel will give you a general parallelization strategy that speeds up the most time-consuming part (cluster assignment).
So both the clustering and the a posteriori assignment steps in a discretization scheme would profit form having a low-level parallelization of computing a distance matrix between two sets of structures. I think it would be best to do this at the C or C++ level, which is why I suggested to have this close to mdtraj that I think has a C or C++ impl of minrmsd. Perhaps there is even some code around doing this that I'm not aware of.
That could also be fast but would take much more effort. Since you are already chunking the data through a common |assign| scheme that then dispatches to the lower-level cluster assignment for chunks, I think you can get a lot of speedup for little work if you simply allow the option of parallelizing the |assign| scheme across chunks, ideally having some concept of chunks that come from different coordinate streams.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/757#issuecomment-205829839
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
OK, we will do that on the PyEMMA level.
Great, thanks!
Any plans for parallel minRMSD clustering, either via
multiprocessing
, OpenMP, or some other parallelization strategy?