Closed jchodera closed 8 years ago
At least these two plots seem to do different things. One is the MLE with the full data, which uses a sliding window. The other one looks like it is a Bayesian MSM, so that would compute the effective count matrix for the sampler, and the solid lines would be the MLE given that effective count matrix which is different from the sliding window count.
So back to the issue - can you confirm directly that the results of the clustering itself (i.e. the cluster centers and the assignment) are really different between 1 and 8 threads?
Am 04/05/16 um 19:54 schrieb John Chodera:
After getting some weird results using parallel minRMSD clustering with equitemporally selected generators in #759 https://github.com/markovmodel/PyEMMA/issues/759, I modified the script (thanks to @maxentile https://github.com/maxentile!) to apply equitemporal minRMSD clustering to an alanine dipeptide dataset. The results appear to be wildly different depending on how many threads I use (selected via |OMP_NUM_THREADS|). Single-threaded results (|plot-1.png|) look great, but when 8 threads are used (|plot-8.png|) I get garbage. plot-1 https://cloud.githubusercontent.com/assets/3656088/15023634/3f9a3f2a-11ff-11e6-93c3-dfabb738ac6d.png plot-8 https://cloud.githubusercontent.com/assets/3656088/15023637/42adb336-11ff-11e6-9226-314032e1c0ab.png
Because selecting generators equitemporally /should/ produce a deterministic set of generators, the subsequent assignment of configurations to generators should be deterministic as well, so there should be /no/ dependence on the number of threads used, nor should the results of clustering be stochastic. The subsequent BHMM sampling of confidence intervals is stochastic, but the MLEs should be deterministic.
My script is here:
!/usr/bin/env python
import pyemma import numpyas np import mdtraj import time import os
Source directory
source_directory =
'/cbio/jclab/projects/fah/fah-data/munged-with-time/no-solvent/11406'
CK2
source_directory= '11406' # CK2
################################################################################
Load reference topology
################################################################################
print('Creating HDF5 trajectories...') from msmbuilder.example_datasetsimport AlanineDipeptide trajs= AlanineDipeptide().get().trajectories trajectoryfilenames= [] for i,trajin enumerate(trajs): fname= 'data/alanine{0}.h5'.format(i) trajectory_filenames.append(fname) traj.save_hdf5(fname)
print ('loading reference topology...') reference_pdb_filename= 'protein.pdb' traj[0].save_pdb(reference_pdb_filename)
################################################################################
Initialize featurizer
################################################################################
print('Initializing featurizer...') import pyemma.coordinates featurizer= pyemma.coordinates.featurizer(reference_pdb_filename) featurizer.add_all()
################################################################################
Define coordinates source
################################################################################
print('Defining coordinates source...') import pyemma.coordinates from globimport glob coordinates_source= pyemma.coordinates.source(trajectory_filenames,features=featurizer) print("There are %d frames total in %d trajectories." % (coordinates_source.n_frames_total(), coordinates_source.number_of_trajectories()))
################################################################################
Cluster
################################################################################
print('Clustering...') generator_ratio= 200 nframes= coordinates_source.n_frames_total() nstates= int(nframes/ generator_ratio) stride= 1 metric= 'minRMSD' initial_time= time.time() clustering= pyemma.coordinates.cluster_uniform_time(data=coordinates_source,k=nstates,stride=stride,metric=metric) final_time= time.time() elapsed_time= final_time- initial_time print('Elapsed time %.3f s' % elapsed_time)
Save discrete trajectories.
dtrajs= clustering.dtrajs dtrajs_dir= 'dtrajs' clustering.save_dtrajs(output_dir=dtrajs_dir,output_format='npy',extension='.npy')
################################################################################
Make timescale plots
################################################################################
import matplotlibas mpl mpl.use('Agg')# Don't use display import matplotlib.pyplotas plt
from pyemmaimport msm from pyemmaimport plots
lags= [1,2,5,10,20,50] its= msm.its(dtrajs,lags=lags,errors='bayes') plots.plot_implied_timescales(its)
plt.savefig('plot.pdf')
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
There are 10 trajectories for which discrete trajectories are produced.
The first (0.npy
) is identical:
>>> import numpy as np
>>> s1 = np.load('dtrajs-1/0.npy')
>>> s8 = np.load('dtrajs-1/0.npy')
>>> s1
array([ 78, 131, 78, ..., 15, 153, 257], dtype=int32)
>>> s8
array([ 78, 131, 78, ..., 15, 153, 257], dtype=int32)
but the next trajectory is very different:
>>> s1 = np.load('dtrajs-1/1.npy')
>>> s8 = np.load('dtrajs-8/1.npy')
>>> s1
array([ 78, 1, 1, ..., 157, 381, 412], dtype=int32)
>>> s8
array([1, 0, 1, ..., 1, 1, 3], dtype=int32)
Happy to share these files if they would be of help!
I see. I guess @marscher needs to look into this. Is it possible that there is just a change of labels because of a different order of listing cluster centers in the multithread version?
Am 04/05/16 um 20:49 schrieb John Chodera:
There are 10 trajectories for which discrete trajectories are produced. The first (|0.npy|) is identical:
|>>> import numpy as np >>> s1 = np.load('dtrajs-1/0.npy') >>> s8 = np.load('dtrajs-1/0.npy') >>> s1 array([ 78, 131, 78, ..., 15, 153, 257], dtype=int32) >>> s8 array([ 78, 131, 78, ..., 15, 153, 257], dtype=int32) |
but the next trajectory is very different:
|>>> s1 = np.load('dtrajs-1/1.npy') >>> s8 = np.load('dtrajs-8/1.npy')
s1 array([ 78, 1, 1, ..., 157, 381, 412], dtype=int32) >>> s8 array([1, 0, 1, ..., 1, 1, 3], dtype=int32) |
Happy to share these files if they would be of help!
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794#issuecomment-216963520
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
The input trajectories have the same order, so the filenames (0.npy
, 1.npy
) should be preserved.
If the cluster labels were exchanged, why would the first trajectory (0.npy
) be assigned the correct cluster center IDs but the other trajectories be assigned different IDs?
Yeah, it seems unlikely. Can you confirm that cluster centers are different too between the two runs or is there just an assignment issue?
Which attribute or method do I want to call to retrieve the cluster generator snapshot indices or configurations?
clustercenters
Am 05/05/16 um 14:58 schrieb John Chodera:
Which attribute or method do I want to call to retrieve the cluster generator snapshot indices or configurations?
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794#issuecomment-217145554
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
The clustercenters
are identical between 1 and 8 threads, but the cluster assignments differ wildly:
[chodera@mskcc-ln1 pyemma]$ python -i analyze.py
Checking cluster centers...
0 / 499 centers differ
Checking cluster assignments...
1617 / 9999 cluster assignents match
214 / 10000 cluster assignents match
202 / 10000 cluster assignents match
194 / 10000 cluster assignents match
136 / 10000 cluster assignents match
123 / 10000 cluster assignents match
62 / 10000 cluster assignents match
184 / 10000 cluster assignents match
152 / 10000 cluster assignents match
196 / 10000 cluster assignents match
Script:
import cPickle as pickle
import numpy as np
import os
print('Checking cluster centers...')
c1 = pickle.load(open('clustercenters-1.p', 'rb'))
c8 = pickle.load(open('clustercenters-1.p', 'rb'))
THRESHOLD = 1.0e-3
ndiff = 0
for (x,y) in zip(c1,c8):
dist = np.sqrt(np.sum((x-y)**2))
if dist > THRESHOLD:
ndiff += 1
print('%d / %d centers differ' % (ndiff, len(c1)))
print('Checking cluster assignments...')
def load_dtrajs(dirname):
trajectories = list()
ntraj = 10
for trajindex in range(ntraj):
filename = os.path.join(dirname, '%d.npy' % trajindex)
trajectory = np.load(filename)
trajectories.append(trajectory)
return trajectories
trajs1 = load_dtrajs('dtrajs-1')
trajs8 = load_dtrajs('dtrajs-8')
for (traj1, traj8) in zip(trajs1, trajs8):
nmatch = (traj1 == traj8).sum()
ntot = len(traj1)
print('%8d / %8d cluster assignents match' % (nmatch, ntot))
Thank you! So then it's definitely an issue, and the problem occurs during assignment, i.e. it is independent of UniformTimeClustering. Let's see what @marscher says.
Am 05/05/16 um 16:53 schrieb John Chodera:
The |clustercenters| are identical between 1 and 8 threads, but the cluster assignments differ wildly:
|[chodera@mskcc-ln1 pyemma]$ python -i analyze.py Checking cluster centers... 0 / 499 centers differ Checking cluster assignments... 1617 / 9999 cluster assignents match 214 / 10000 cluster assignents match 202 / 10000 cluster assignents match 194 / 10000 cluster assignents match 136 / 10000 cluster assignents match 123 / 10000 cluster assignents match 62 / 10000 cluster assignents match 184 / 10000 cluster assignents match 152 / 10000 cluster assignents match 196 / 10000 cluster assignents match |
Script:
import cPickleas pickle import numpyas np import os
print('Checking cluster centers...') c1= pickle.load(open('clustercenters-1.p','rb')) c8= pickle.load(open('clustercenters-1.p','rb'))
THRESHOLD = 1.0e-3 ndiff= 0 for (x,y)in zip(c1,c8): dist= np.sqrt(np.sum((x-y)**2)) if dist> THRESHOLD: ndiff+= 1 print('%d / %d centers differ' % (ndiff,len(c1)))
print('Checking cluster assignments...')
def load_dtrajs(dirname): trajectories= list() ntraj= 10 for trajindexin range(ntraj): filename= os.path.join(dirname,'%d.npy' % trajindex) trajectory= np.load(filename) trajectories.append(trajectory) return trajectories
trajs1= load_dtrajs('dtrajs-1') trajs8= load_dtrajs('dtrajs-8')
for (traj1, traj8)in zip(trajs1, trajs8): nmatch= (traj1== traj8).sum() ntot= len(traj1) print('%8d / %8d cluster assignents match' % (nmatch, ntot))
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794#issuecomment-217175353
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
Thank you again for testing/analyzing potential problems for the new parallel assignment code, but unfortunately I can not reproduce what you just observed with a simple test like this:
def test_assignment_multithread(self):
# re-do assignment with multiple threads and compare results
n = 10000
dim = 100
chunksize=1000
X= np.random.random((n, dim))
centers = X[np.random.choice(n, dim)]
assignment_mp = coor.assign_to_centers(X, centers, n_jobs=4, chunk_size=chunksize)
assignment_sp = coor.assign_to_centers(X, centers, n_jobs=1, chunk_size=chunksize)
np.testing.assert_equal(assignment_mp, assignment_sp)
How many threads/ or how many cores does your testing system have? I will re-run our tests on a system with more cores, but I would assume that if the OMP code is correct, it will reproduce the same results no matter how many threads one uses.
The used chunk size may also be of interest to me
I think it is only related to the minRMSD metric. I will look into this
There are a lot of slightly populated centers in the parallel case. Since the loop is the same for euclidean distance, I assume that there is some kind of interference of the SSE code (mdtrajs theobald algorithm implementation) with OpenMP threads.
serial execution:
[119 51 155 15 143 43 17 30 28 143 59 142 12 18 255 55 43 55
53 15 444 103 76 234 7 32 41 134 17 22 11 17 40 118 92 71
119 142 50 61 141 3 120 51 36 0 8 31 634 142 115 24 51 77
565 58 7 112 21 76 692 57 139 29 287 33 75 96 63 122 95 96
96 266 167 56 18 40 79 13 19 34 47 130 87 53 46 17 121 78
197 132 112 9 21 15 273 52 445 39]
4 threads:
[1785 1113 1041 841 840 657 515 456 372 346 283 222 205 166 159
115 92 90 72 57 67 45 48 31 30 25 23 20 12 16
12 13 8 6 7 9 12 14 6 5 6 2 3 3 4
4 1 1 11 3 2 0 4 3 3 1 0 2 0 4
16 1 5 2 4 0 1 3 1 1 4 0 3 6 8
1 1 3 4 1 1 1 1 4 1 2 1 1 2 1
1 3 1 0 0 0 7 1 12 2]
I've already checked that the centers array stays unmodified for every iteration. So at least there is no side effect due to a changing the basis of assignment...
It is really strange.
@rmcgibbo would you please comment on this?
I'll suggest for now to enforce single threaded execution in case of minRMSD metric, otherwise people will just run into the same problems as John. I'll put this in the already opened PR for this issue.
Agreed. Thanks.
Am 06/05/16 um 14:12 schrieb Martin K. Scherer:
I'll suggest for now to enforce single threaded execution in case of minRMSD metric, otherwise people will just run into the same problems as John. I'll put this in the already opened PR for this issue.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794#issuecomment-217424209
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
Thanks for the detective work here!
@mpharrigan may also have insight into mdtraj thread safety.
This has been fixed in #796
yay!
Am 09/05/16 um 11:39 schrieb Martin K. Scherer:
Closed #794 https://github.com/markovmodel/PyEMMA/issues/794.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/794#event-654159444
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
After getting some weird results using parallel minRMSD clustering with equitemporally selected generators in #759, I modified the script (thanks to @maxentile!) to apply equitemporal minRMSD clustering to an alanine dipeptide dataset. The results appear to be wildly different depending on how many threads I use (selected via
OMP_NUM_THREADS
). Single-threaded results (plot-1.png
) look great, but when 8 threads are used (plot-8.png
) I get garbage.Because selecting generators equitemporally should produce a deterministic set of generators, the subsequent assignment of configurations to generators should be deterministic as well, so there should be no dependence on the number of threads used, nor should the results of clustering be stochastic. The subsequent BHMM sampling of confidence intervals is stochastic, but the MLEs should be deterministic.
My script is here: