markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
310 stars 119 forks source link

kmeans speed up #1347

Closed euhruska closed 6 years ago

euhruska commented 6 years ago

I sthere a way to speed up kmeans? See: https://github.com/radical-collaboration/extasy-grlsd/issues/66

Using python 2.7.11, pyemma 2.5.4

export PYEMMA_NJOBS=1
export OMP_NUM_THREADS=1 
/opt/xalt/0.7.6/sles11.3/bin/aprun -n 1 -N 1 -L 18544 -d 1 -cc 0 python "run-tica-msm.py"

with ('n atoms', 132) ('n frames total', 4800000) ('n trajs', 900) the kmeans step takes hours:

cl = pyemma.coordinates.cluster_kmeans(data=y, k=msm_states, max_iter=10, stride=msm_stride)
thempel commented 6 years ago

The easiest option to speed-up k-means is of course by increasing the stride, i.e. by using less data for the clustering. Mapping down to less dimensions before should help, too, or taking less cluster centers.

Alternatively, you can try pyemma.coordinates.cluster_mini_batch_kmeans() (http://www.emma-project.org/latest/api/generated/pyemma.coordinates.cluster_mini_batch_kmeans.html) as an approximation to k-means.

But one should also note that huge data sets require some time for the computation.

euhruska commented 6 years ago

Is it possible to parallelize kmeans? couldn't get it to work

clonker commented 6 years ago

Since you set OMP_NUM_THREADS and PYEMMA_NJOBS to 1 all parallelizations are switched off. The initialization of the centers (probably being the time consuming part) can not fully be parallelized but to some extent. The actual iteration is parallelized.

euhruska commented 6 years ago

works