Parallel minRMSD clustering

jchodera commented 8 years ago

Any plans for parallel minRMSD clustering, either via multiprocessing, OpenMP, or some other parallelization strategy?

franknoe commented 8 years ago

None that I'm aware of. I think minRMSD clustering support in PyEMMA is rudimentary, as all efforts have went into approaches such as clustering in TICA subspaces that appears to give much better models than clustering in direct geometrical metrics. However, if anyone is interested in implementing something RMSD-based, we can discuss if and where it would fit in.

I think a suitable strategy for supporting requests like this one without having to re-implement a large list of methods would be to expose a sufficiently general coordinate transformation interface such that you could plug in any clustering algorithm which implements a general clustering interface. It would certainly be useful to have access to e.g. sklearn clustering algorithms through the framework. We'll have a developer meeting tomorrow or wednesday and can discuss this.

Am 04/04/16 um 22:58 schrieb John Chodera:

Any plans for parallel minRMSD clustering, either via |multiprocessing|, OpenMP, or some other parallelization strategy?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/757

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

jchodera commented 8 years ago

None that I'm aware of. I think minRMSD clustering support in PyEMMA is rudimentary, as all efforts have went into approaches such as clustering in TICA subspaces that appears to give much better models than clustering in direct geometrical metrics.

Can we really say this is the case if we can't easily compare them?

However, if anyone is interested in implementing something RMSD-based, we can discuss if and where it would fit in.

I'd love to explore this!

I think a suitable strategy for supporting requests like this one without having to re-implement a large list of methods would be to expose a sufficiently general coordinate transformation interface such that you could plug in any clustering algorithm which implements a general clustering interface. It would certainly be useful to have access to e.g. sklearn clustering algorithms through the framework. We'll have a developer meeting tomorrow or wednesday and can discuss this.

That would be awesome.

Another easy thing to do might be to simply support a parallel scheme in assign, similar to the n_jobs argument in timescales_msm.

mpharrigan commented 8 years ago

Can we really say this is the case if we can't easily compare them?

You can use gmrq cross validation, which we've found supports @franknoe 's claim that rmsd clustering isn't all that great

nsplattner commented 8 years ago

There is a practical limit for using direct geometrical metrics: in very high-dimensional systems, these metrics will result either in a poor resolution of state space (using a large cutoff) or poor statistics for each states (using a small cutoff). Therefore in general it is desirable to use a subset of coordinates. TICA is supposed to select a meaningful subset, but in practice its unclear whether this subset is optimal in all case and a 'manual' selection of coordinates may be required.

For a general evaluation of datasets I think its good to have fast minRMSD routines available to evaluate the sampling of coordinate space.

jchodera commented 8 years ago

You can use gmrq cross validation, which we've found supports @franknoe 's claim that rmsd clustering isn't all that great

Definitely want to compare with GMRQ, but we have to actually be able to cluster a full dataset to compare! @maxentile can't actually cluster our data with MSMBuilder without it throwing a segfault, so we were hoping to get through a dataset with pyemma.

jchodera commented 8 years ago

There is a practical limit for using direct geometrical metrics: in very high-dimensional systems, these metrics will result either in a poor resolution of state space (using a large cutoff) or poor statistics for each states (using a small cutoff).

minRMSD-based clustering with a fixed number of generators does not use a specific cutoff, and can automatically discover the low dimensional manifold on which he data actually lives even if it is nonlinear. I think RMSD based methods still have a lot of life left in them, though they may need weighting to suppress components that rapidly relax (much as TICA suppresses these) in order to not require enormous datasets.

nsplattner commented 8 years ago

Suppressing the rapildly relaxing components in RMSD clustering is an interesting idea. This would result in a combination of minRMSD metrics with an analysis of relaxation times as done in TICA. TICA in minRMSD space instead of euclidean space?

jchodera commented 8 years ago

Precisely! @jhprinz has been briefed on this if you want to know more

franknoe commented 8 years ago

Since I guess we don't want to parallelize each clustering method / metric pair separately, it looks to me like a what is needed to speed minRMSD-based methods up is a parallel version that computes all pairs of minRMSDs between sets of structures I and J (where I could be the cluster centers and J a chunk of trajectory data). Wouldn't mdtraj be a better place to implement this than PyEMMA?

Am 05/04/16 um 00:33 schrieb John Chodera:

Precisely! @jhprinz https://github.com/jhprinz has been briefed on this if you want to know more

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/757#issuecomment-205523065

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

jchodera commented 8 years ago

It seemed like the 'assign' framework was quite general, and the most time consuming part. I wonder if this could be effectively parallelized in a general way within pyemma by simply parallelizing the processing of chunks. Ideally, individual processes would handle sequential streams of chunks for disk caching efficiency, but I am not sure how much control there is over chunking.

franknoe commented 8 years ago

I think for a cheap clustering method such as reg-space or k-centers, clustering and assignment is of the same order O(k*T), where k is the number of clusters and T the number of time steps. For other methods such as k-medoids, the clustering step can be more expensive. So we would have to both speed up the clustering and the assignment step in order to get a substantial gain.

However even within the clustering step you have to do CPU-intensive assignment (is that what you meant?). In the clustering step, at any time when you have a data chunk with N trajectory frames and k clusters, you need to compute N*k distances (e.g. minRMSD if that is your metric).

So both the clustering and the a posteriori assignment steps in a discretization scheme would profit form having a low-level parallelization of computing a distance matrix between two sets of structures. I think it would be best to do this at the C or C++ level, which is why I suggested to have this close to mdtraj that I think has a C or C++ impl of minrmsd. Perhaps there is even some code around doing this that I'm not aware of.

Am 05/04/16 um 15:20 schrieb John Chodera:

It seemed like the 'assign' framework was quite general, and the most time consuming part. I wonder if this could be effectively parallelized in a general way within pyemma by simply parallelizing the processing of chunks. Ideally, individual processes would handle sequential streams of chunks for disk caching efficiency, but I am not sure how much control there is over chunking.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/757#issuecomment-205801099

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

jchodera commented 8 years ago

It's actually a bit hard for me to follow the control flow in the pyemma clustering code, but all of the clustering operations go back and forth between a generator/centers update step and an assignment step. If the same assign path is called for all clustering codes, which already breaks things up by chunks, simply making that part parallel will give you a general parallelization strategy that speeds up the most time-consuming part (cluster assignment).

So both the clustering and the a posteriori assignment steps in a discretization scheme would profit form having a low-level parallelization of computing a distance matrix between two sets of structures. I think it would be best to do this at the C or C++ level, which is why I suggested to have this close to mdtraj that I think has a C or C++ impl of minrmsd. Perhaps there is even some code around doing this that I'm not aware of.

That could also be fast but would take much more effort. Since you are already chunking the data through a common assign scheme that then dispatches to the lower-level cluster assignment within the chunks, I think you can get a lot of speedup for little work if you simply allow the option of parallelizing the assign scheme across chunks, ideally having some concept of chunks that come from different coordinate streams.

franknoe commented 8 years ago

OK, we'll discuss this tomorrow. Will let you know.

Am 05/04/16 um 16:17 schrieb John Chodera:

It's actually a bit hard for me to follow the control flow in the |pyemma| clustering code, but all of the clustering operations go back and forth between a generator/centers update step and an assignment step. If the same |assign| path is called for all clustering codes, which already breaks things up by chunks, simply making that part parallel will give you a general parallelization strategy that speeds up the most time-consuming part (cluster assignment).
So both the clustering and the a posteriori assignment steps in a
discretization scheme would profit form having a low-level
parallelization of computing a distance matrix between two sets of
structures. I think it would be best to do this at the C or C++
level, which is why I suggested to have this close to mdtraj that
I think has a C or C++ impl of minrmsd. Perhaps there is even some
code around doing this that I'm not aware of.
That could also be fast but would take much more effort. Since you are already chunking the data through a common |assign| scheme that then dispatches to the lower-level cluster assignment for chunks, I think you can get a lot of speedup for little work if you simply allow the option of parallelizing the |assign| scheme across chunks, ideally having some concept of chunks that come from different coordinate streams.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/757#issuecomment-205829839

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

franknoe commented 8 years ago

OK, we will do that on the PyEMMA level.

jchodera commented 8 years ago

Great, thanks!

markovmodel / PyEMMA