markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
311 stars 119 forks source link

minRMSD clustering with equitemporally-selected generators gives wildly different results depending on number of threads used. #794

Closed jchodera closed 8 years ago

jchodera commented 8 years ago

After getting some weird results using parallel minRMSD clustering with equitemporally selected generators in #759, I modified the script (thanks to @maxentile!) to apply equitemporal minRMSD clustering to an alanine dipeptide dataset. The results appear to be wildly different depending on how many threads I use (selected via OMP_NUM_THREADS). Single-threaded results (plot-1.png) look great, but when 8 threads are used (plot-8.png) I get garbage. plot-1 plot-8

Because selecting generators equitemporally should produce a deterministic set of generators, the subsequent assignment of configurations to generators should be deterministic as well, so there should be no dependence on the number of threads used, nor should the results of clustering be stochastic. The subsequent BHMM sampling of confidence intervals is stochastic, but the MLEs should be deterministic.

My script is here:

#!/usr/bin/env python

import pyemma
import numpy as np
import mdtraj
import time
import os

# Source directory
#source_directory = '/cbio/jclab/projects/fah/fah-data/munged-with-time/no-solvent/11406' # CK2
source_directory = '11406' # CK2

################################################################################
# Load reference topology
################################################################################

print('Creating HDF5 trajectories...')
from msmbuilder.example_datasets import AlanineDipeptide
trajs = AlanineDipeptide().get().trajectories
trajectory_filenames = []
for i,traj in enumerate(trajs):
    fname = 'data/alanine_{0}.h5'.format(i)
    trajectory_filenames.append(fname)
    traj.save_hdf5(fname)

print ('loading reference topology...')
reference_pdb_filename = 'protein.pdb'
traj[0].save_pdb(reference_pdb_filename)

################################################################################
# Initialize featurizer
################################################################################

print('Initializing featurizer...')
import pyemma.coordinates
featurizer = pyemma.coordinates.featurizer(reference_pdb_filename)
featurizer.add_all()

################################################################################
# Define coordinates source
################################################################################

print('Defining coordinates source...')
import pyemma.coordinates
from glob import glob
coordinates_source = pyemma.coordinates.source(trajectory_filenames, features=featurizer)
print("There are %d frames total in %d trajectories." % (coordinates_source.n_frames_total(), coordinates_source.number_of_trajectories()))

################################################################################
# Cluster
################################################################################

print('Clustering...')
generator_ratio = 200
nframes = coordinates_source.n_frames_total()
nstates = int(nframes / generator_ratio)
stride = 1
metric = 'minRMSD'
initial_time = time.time()
clustering = pyemma.coordinates.cluster_uniform_time(data=coordinates_source, k=nstates, stride=stride, metric=metric)
final_time = time.time()
elapsed_time = final_time - initial_time
print('Elapsed time %.3f s' % elapsed_time)

# Save discrete trajectories.
dtrajs = clustering.dtrajs
dtrajs_dir = 'dtrajs'
clustering.save_dtrajs(output_dir=dtrajs_dir, output_format='npy', extension='.npy')

################################################################################
# Make timescale plots
################################################################################

import matplotlib as mpl
mpl.use('Agg') # Don't use display
import matplotlib.pyplot as plt

from pyemma import msm
from pyemma import plots

lags = [1,2,5,10,20,50]
its = msm.its(dtrajs, lags=lags, errors='bayes')
plots.plot_implied_timescales(its)

plt.savefig('plot.pdf')
franknoe commented 8 years ago

At least these two plots seem to do different things. One is the MLE with the full data, which uses a sliding window. The other one looks like it is a Bayesian MSM, so that would compute the effective count matrix for the sampler, and the solid lines would be the MLE given that effective count matrix which is different from the sliding window count.

So back to the issue - can you confirm directly that the results of the clustering itself (i.e. the cluster centers and the assignment) are really different between 1 and 8 threads?

Am 04/05/16 um 19:54 schrieb John Chodera:

After getting some weird results using parallel minRMSD clustering with equitemporally selected generators in #759 https://github.com/markovmodel/PyEMMA/issues/759, I modified the script (thanks to @maxentile https://github.com/maxentile!) to apply equitemporal minRMSD clustering to an alanine dipeptide dataset. The results appear to be wildly different depending on how many threads I use (selected via |OMP_NUM_THREADS|). Single-threaded results (|plot-1.png|) look great, but when 8 threads are used (|plot-8.png|) I get garbage. plot-1 https://cloud.githubusercontent.com/assets/3656088/15023634/3f9a3f2a-11ff-11e6-93c3-dfabb738ac6d.png plot-8 https://cloud.githubusercontent.com/assets/3656088/15023637/42adb336-11ff-11e6-9226-314032e1c0ab.png

Because selecting generators equitemporally /should/ produce a deterministic set of generators, the subsequent assignment of configurations to generators should be deterministic as well, so there should be /no/ dependence on the number of threads used, nor should the results of clustering be stochastic. The subsequent BHMM sampling of confidence intervals is stochastic, but the MLEs should be deterministic.

My script is here:

!/usr/bin/env python

import pyemma import numpyas np import mdtraj import time import os

Source directory

source_directory =

'/cbio/jclab/projects/fah/fah-data/munged-with-time/no-solvent/11406'

CK2

source_directory= '11406' # CK2

################################################################################

Load reference topology

################################################################################

print('Creating HDF5 trajectories...') from msmbuilder.example_datasetsimport AlanineDipeptide trajs= AlanineDipeptide().get().trajectories trajectoryfilenames= [] for i,trajin enumerate(trajs): fname= 'data/alanine{0}.h5'.format(i) trajectory_filenames.append(fname) traj.save_hdf5(fname)

print ('loading reference topology...') reference_pdb_filename= 'protein.pdb' traj[0].save_pdb(reference_pdb_filename)

################################################################################

Initialize featurizer

################################################################################

print('Initializing featurizer...') import pyemma.coordinates featurizer= pyemma.coordinates.featurizer(reference_pdb_filename) featurizer.add_all()

################################################################################

Define coordinates source

################################################################################

print('Defining coordinates source...') import pyemma.coordinates from globimport glob coordinates_source= pyemma.coordinates.source(trajectory_filenames,features=featurizer) print("There are %d frames total in %d trajectories." % (coordinates_source.n_frames_total(), coordinates_source.number_of_trajectories()))

################################################################################

Cluster

################################################################################

print('Clustering...') generator_ratio= 200 nframes= coordinates_source.n_frames_total() nstates= int(nframes/ generator_ratio) stride= 1 metric= 'minRMSD' initial_time= time.time() clustering= pyemma.coordinates.cluster_uniform_time(data=coordinates_source,k=nstates,stride=stride,metric=metric) final_time= time.time() elapsed_time= final_time- initial_time print('Elapsed time %.3f s' % elapsed_time)

Save discrete trajectories.

dtrajs= clustering.dtrajs dtrajs_dir= 'dtrajs' clustering.save_dtrajs(output_dir=dtrajs_dir,output_format='npy',extension='.npy')

################################################################################

Make timescale plots

################################################################################

import matplotlibas mpl mpl.use('Agg')# Don't use display import matplotlib.pyplotas plt

from pyemmaimport msm from pyemmaimport plots

lags= [1,2,5,10,20,50] its= msm.its(dtrajs,lags=lags,errors='bayes') plots.plot_implied_timescales(its)

plt.savefig('plot.pdf')

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

jchodera commented 8 years ago

There are 10 trajectories for which discrete trajectories are produced. The first (0.npy) is identical:

>>> import numpy as np
>>> s1 = np.load('dtrajs-1/0.npy')
>>> s8 = np.load('dtrajs-1/0.npy')
>>> s1
array([ 78, 131,  78, ...,  15, 153, 257], dtype=int32)
>>> s8
array([ 78, 131,  78, ...,  15, 153, 257], dtype=int32)

but the next trajectory is very different:

>>> s1 = np.load('dtrajs-1/1.npy')
>>> s8 = np.load('dtrajs-8/1.npy')
>>> s1
array([ 78,   1,   1, ..., 157, 381, 412], dtype=int32)
>>> s8
array([1, 0, 1, ..., 1, 1, 3], dtype=int32)

Happy to share these files if they would be of help!

franknoe commented 8 years ago

I see. I guess @marscher needs to look into this. Is it possible that there is just a change of labels because of a different order of listing cluster centers in the multithread version?

Am 04/05/16 um 20:49 schrieb John Chodera:

There are 10 trajectories for which discrete trajectories are produced. The first (|0.npy|) is identical:

|>>> import numpy as np >>> s1 = np.load('dtrajs-1/0.npy') >>> s8 = np.load('dtrajs-1/0.npy') >>> s1 array([ 78, 131, 78, ..., 15, 153, 257], dtype=int32) >>> s8 array([ 78, 131, 78, ..., 15, 153, 257], dtype=int32) |

but the next trajectory is very different:

|>>> s1 = np.load('dtrajs-1/1.npy') >>> s8 = np.load('dtrajs-8/1.npy')

s1 array([ 78, 1, 1, ..., 157, 381, 412], dtype=int32) >>> s8 array([1, 0, 1, ..., 1, 1, 3], dtype=int32) |

Happy to share these files if they would be of help!

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794#issuecomment-216963520


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

jchodera commented 8 years ago

The input trajectories have the same order, so the filenames (0.npy, 1.npy) should be preserved.

If the cluster labels were exchanged, why would the first trajectory (0.npy) be assigned the correct cluster center IDs but the other trajectories be assigned different IDs?

franknoe commented 8 years ago

Yeah, it seems unlikely. Can you confirm that cluster centers are different too between the two runs or is there just an assignment issue?

jchodera commented 8 years ago

Which attribute or method do I want to call to retrieve the cluster generator snapshot indices or configurations?

http://www.emma-project.org/latest/api/generated/pyemma.coordinates.clustering.UniformTimeClustering.html

franknoe commented 8 years ago

clustercenters

Am 05/05/16 um 14:58 schrieb John Chodera:

Which attribute or method do I want to call to retrieve the cluster generator snapshot indices or configurations?

http://www.emma-project.org/latest/api/generated/pyemma.coordinates.clustering.UniformTimeClustering.html

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794#issuecomment-217145554


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

jchodera commented 8 years ago

The clustercenters are identical between 1 and 8 threads, but the cluster assignments differ wildly:

[chodera@mskcc-ln1 pyemma]$ python -i analyze.py
Checking cluster centers...
0 / 499 centers differ
Checking cluster assignments...
    1617 /     9999 cluster assignents match
     214 /    10000 cluster assignents match
     202 /    10000 cluster assignents match
     194 /    10000 cluster assignents match
     136 /    10000 cluster assignents match
     123 /    10000 cluster assignents match
      62 /    10000 cluster assignents match
     184 /    10000 cluster assignents match
     152 /    10000 cluster assignents match
     196 /    10000 cluster assignents match

Script:

import cPickle as pickle
import numpy as np
import os

print('Checking cluster centers...')
c1 = pickle.load(open('clustercenters-1.p', 'rb'))
c8 = pickle.load(open('clustercenters-1.p', 'rb'))

THRESHOLD = 1.0e-3
ndiff = 0
for (x,y) in zip(c1,c8):
   dist = np.sqrt(np.sum((x-y)**2))
   if dist > THRESHOLD:
       ndiff += 1
print('%d / %d centers differ' % (ndiff, len(c1)))

print('Checking cluster assignments...')

def load_dtrajs(dirname):
   trajectories = list()
   ntraj = 10
   for trajindex in range(ntraj):
      filename = os.path.join(dirname, '%d.npy' % trajindex) 
      trajectory = np.load(filename)
      trajectories.append(trajectory)
   return trajectories

trajs1 = load_dtrajs('dtrajs-1')
trajs8 = load_dtrajs('dtrajs-8')

for (traj1, traj8) in zip(trajs1, trajs8):
    nmatch = (traj1 == traj8).sum()
    ntot = len(traj1)
    print('%8d / %8d cluster assignents match' % (nmatch, ntot))
franknoe commented 8 years ago

Thank you! So then it's definitely an issue, and the problem occurs during assignment, i.e. it is independent of UniformTimeClustering. Let's see what @marscher says.

Am 05/05/16 um 16:53 schrieb John Chodera:

The |clustercenters| are identical between 1 and 8 threads, but the cluster assignments differ wildly:

|[chodera@mskcc-ln1 pyemma]$ python -i analyze.py Checking cluster centers... 0 / 499 centers differ Checking cluster assignments... 1617 / 9999 cluster assignents match 214 / 10000 cluster assignents match 202 / 10000 cluster assignents match 194 / 10000 cluster assignents match 136 / 10000 cluster assignents match 123 / 10000 cluster assignents match 62 / 10000 cluster assignents match 184 / 10000 cluster assignents match 152 / 10000 cluster assignents match 196 / 10000 cluster assignents match |

Script:

import cPickleas pickle import numpyas np import os

print('Checking cluster centers...') c1= pickle.load(open('clustercenters-1.p','rb')) c8= pickle.load(open('clustercenters-1.p','rb'))

THRESHOLD = 1.0e-3 ndiff= 0 for (x,y)in zip(c1,c8): dist= np.sqrt(np.sum((x-y)**2)) if dist> THRESHOLD: ndiff+= 1 print('%d / %d centers differ' % (ndiff,len(c1)))

print('Checking cluster assignments...')

def load_dtrajs(dirname): trajectories= list() ntraj= 10 for trajindexin range(ntraj): filename= os.path.join(dirname,'%d.npy' % trajindex) trajectory= np.load(filename) trajectories.append(trajectory) return trajectories

trajs1= load_dtrajs('dtrajs-1') trajs8= load_dtrajs('dtrajs-8')

for (traj1, traj8)in zip(trajs1, trajs8): nmatch= (traj1== traj8).sum() ntot= len(traj1) print('%8d / %8d cluster assignents match' % (nmatch, ntot))

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794#issuecomment-217175353


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

marscher commented 8 years ago

Thank you again for testing/analyzing potential problems for the new parallel assignment code, but unfortunately I can not reproduce what you just observed with a simple test like this:


     def test_assignment_multithread(self):
         # re-do assignment with multiple threads and compare results
         n = 10000
         dim = 100
         chunksize=1000
         X= np.random.random((n, dim))
         centers = X[np.random.choice(n, dim)]

         assignment_mp = coor.assign_to_centers(X, centers, n_jobs=4, chunk_size=chunksize)
         assignment_sp = coor.assign_to_centers(X, centers, n_jobs=1, chunk_size=chunksize)

         np.testing.assert_equal(assignment_mp, assignment_sp)

How many threads/ or how many cores does your testing system have? I will re-run our tests on a system with more cores, but I would assume that if the OMP code is correct, it will reproduce the same results no matter how many threads one uses.

marscher commented 8 years ago

The used chunk size may also be of interest to me

marscher commented 8 years ago

I think it is only related to the minRMSD metric. I will look into this

marscher commented 8 years ago

There are a lot of slightly populated centers in the parallel case. Since the loop is the same for euclidean distance, I assume that there is some kind of interference of the SSE code (mdtrajs theobald algorithm implementation) with OpenMP threads.

serial execution:
[119  51 155  15 143  43  17  30  28 143  59 142  12  18 255  55  43  55
  53  15 444 103  76 234   7  32  41 134  17  22  11  17  40 118  92  71
 119 142  50  61 141   3 120  51  36   0   8  31 634 142 115  24  51  77
 565  58   7 112  21  76 692  57 139  29 287  33  75  96  63 122  95  96
  96 266 167  56  18  40  79  13  19  34  47 130  87  53  46  17 121  78
 197 132 112   9  21  15 273  52 445  39]
4 threads:
[1785 1113 1041  841  840  657  515  456  372  346  283  222  205  166  159
  115   92   90   72   57   67   45   48   31   30   25   23   20   12   16
   12   13    8    6    7    9   12   14    6    5    6    2    3    3    4
    4    1    1   11    3    2    0    4    3    3    1    0    2    0    4
   16    1    5    2    4    0    1    3    1    1    4    0    3    6    8
    1    1    3    4    1    1    1    1    4    1    2    1    1    2    1
    1    3    1    0    0    0    7    1   12    2]

I've already checked that the centers array stays unmodified for every iteration. So at least there is no side effect due to a changing the basis of assignment...

It is really strange.

@rmcgibbo would you please comment on this?

marscher commented 8 years ago

I'll suggest for now to enforce single threaded execution in case of minRMSD metric, otherwise people will just run into the same problems as John. I'll put this in the already opened PR for this issue.

franknoe commented 8 years ago

Agreed. Thanks.

Am 06/05/16 um 14:12 schrieb Martin K. Scherer:

I'll suggest for now to enforce single threaded execution in case of minRMSD metric, otherwise people will just run into the same problems as John. I'll put this in the already opened PR for this issue.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794#issuecomment-217424209


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

jchodera commented 8 years ago

Thanks for the detective work here!

@mpharrigan may also have insight into mdtraj thread safety.

jchodera commented 8 years ago

@marscher: Are you sure it isn't just a problem with the OpenMP directives?

On this line, you are copying data into buffer_a and buffer_b in each thread, but you haven't declared those buffers to be thread-private in this line. Shouldn't those buffers be independent for each thread?

marscher commented 8 years ago

This has been fixed in #796

franknoe commented 8 years ago

yay!

Am 09/05/16 um 11:39 schrieb Martin K. Scherer:

Closed #794 https://github.com/markovmodel/PyEMMA/issues/794.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794#event-654159444


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany