markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
307 stars 118 forks source link

mfpt questions #1443

Closed Sizhang92190 closed 4 years ago

Sizhang92190 commented 4 years ago

Hi,

I'm using tram estimator from PyEMMA (2.5.6) and computing mfpt with the msmobj which obtained from tram calculations. I found some problems and hope you can help me figure out.

A.) Here is an msmobj I got from tram calculation. array(ThermoMSM(P=array([[9.99347e-01, 3.06864e-09, ..., 2.63313e-08, 1.40791e-04], [1.53822e-03, 7.30838e-01, ..., 0.00000e+00, 0.00000e+00], ..., [8.14649e-03, 0.00000e+00, ..., 7.59904e-01, 2.15520e-03], [6.94215e-01, 0.00000e+00, ..., 3.43485e-05, 2.90880e-01]]), active_set=array([ 0, 19, 26, 41, 48, 50, 60, 64, 65, 73, 76, 77, 78, 91], dtype=int32), dt_model=[TimeUnit 200.0 step], neig=14, nstates_full=97, pi=array([9.96825e-01, 0.00000e+00, ..., 0.00000e+00, 7.69390e-07]), reversible=True), dtype=object)

When I want compute the mfpt from state [19, 26] to state [0], I use this command msm.mfpt([19, 26], [0]). I got an error: AssertionError: Chosen set contains states that are not included in the active set. So that means, I should use msm.mfpt([1, 2], [0]). But I got an another error: RuntimeWarning: invalid value encountered in true_divide muX = nuX / np.sum(nuX). Do you know what's the problem for it?

B.) Next, I 'd like use the same msmobj to do tpt calculation. I used the below commands: from pyemma import msm
msm.tpt(mm, [2, 3], [0]) ### here, mm is the msmobj from tram result I can't do it because there's an error: ValueError: operands could not be broadcast together with shapes (97,1) (14,14). I guess the reason is the shape of mm.P and mm.active_set are not equal the shape of mm.pi. Do you have some suggestions?

C.) Last, I found two ways to compute mfpt, are they same things? 1.) from pyemma import msm
msm.tpt(mm, A, B) ### here, mm is the msmobj from tram result 2.) mm.mfpt(A, B) ### here, mm is the msmobj from tram result

When A and B are both one state, the results (from a test) are 114531.30349866912 (from (1)) and 114531.30440262843 (from (2)). They're not exactly same? When A are a few states, the results are 741.4762666303764 (from (1)) and 15980.518823519338 (from (2)). They're so different?

Thanks for any help!!

fabian-paul commented 4 years ago

Hi,

there seem to be something weird going on in (B). You would mind sending the msm/memm objects so that I can have a closer look? I have to admit that dealing with the different active sets of the MSMs is a bit inconvenient, especially when it comes to memms which can have different active/connected set depending on the thermodynamics state.

One part of the solution to (A) is to figure out the actual state indices of states 0, 19, 26 in the returned msm object. And then to use the actual indices in the mfpt and tpt fuction calls. The actual indices can be found by searching the active set:

np.where(np.in1d(mm.active_set, [19, 26]))
np.where(np.in1d(mm.active_set, [0]))

The results in (C) should be the same no matter if you use mfpt or tpt.mfpt. If they are not, you found a bug. Before searching for a bug, it would be good to rule out any error in the data. Are you using (1) and (2) with the exact same MSM? Also note that mfpts to transitions states are not well-defined in the context of discrete time MSMs. Do you observe any systematic trend like only mftps to transition states being inconsistent across the two methods?

Best, Fabian

Sizhang92190 commented 4 years ago

Hi Fabian,

Thanks for your reply. Here're two exampled msm models obtained from tram calculations. test_file.zip ----model_1 is for testing my questions A and B. ----model_2 is for testing my questions C. -- About question A: As you suggested, I tried to compute mfpt (from states [19,26] to state [0]) with the actual indices of states 0, 19, 26, using your codes. I still got this error: RuntimeWarning: invalid value encountered in true_divide muX = nuX / np.sum(nuX). I really suspect that it's using the populations of state 1 and 2 instead of using the pops of states 19 and 26. But I can't check it, since mm.pi in this model has 97 populations, neither corresponding to the shape of mm.active_set nor corresponding to the number of full states, 100. You could check it with the model_1 I provide.

-- About question B: If I still use model_1 to do tpt calculation (no matter which states I used), I would get this error: ValueError: operands could not be broadcast together with shapes (97, 1) (14, 14).

-- About question C: I may know why I got so big different results between these two ways calculations in that example.(Maybe due to the transition states issue) But there's still some difference between them. If I have a "good" model (mm.P.shape = mm.active_set = mm.pi.shape = number of full states = 100), I could do some tpt calculations. Model_2 is one of this kinds of models. mm.mfpt([1,2], [0]) = 338061.99678365665 msm.tpt(mm, [1,2],[0]).mfpt = 338216.5122777814 (state 0 is a fully folded state and state 2 & 3 are fully unfolded states.) They're similar, not same. Hopefully, you can help me find some answer with my models. Thanks!

Best regards Si

fabian-paul commented 4 years ago

Hello Si,

I agree, there is something broken in PyEMMA. mm.stationary_distribution and mm.stationary_distribution_full_state seem to be implemented incorrectly. I wonder how we missed that ... I think this is the origin of all problems A, B, C.

For the moment, you can fall back to msmtools to do the analysis, while I try to fix the bug.

import msmtools
msmtools.analysis.mfpt(mm.P, np.where(np.in1d(mm.active_set,[19, 26]))[0], np.where(np.in1d(mm.active_set,[0]))[0])

Fabian

fabian-paul commented 4 years ago

Can you please check out the fix_memm_stat_dist branch from my fork and retry? https://github.com/fabian-paul/PyEMMA/tree/fix_memm_stat_dist

Sizhang92190 commented 4 years ago

Hi Fabian,

Thanks for you reply. To be honest, I'm not very sure If I'm doing a right way to test these fixed scripts. 1.) I downloaded this fix_memm_stat_dist folder. 2.) Then, I used the below commands to do tram calculations. from PyEMMA_fix_memm_stat_dist.pyemma.thermo import tram tram_obj = tram(ttraj, dtraj, btraj, lagtime, unbiased_state=0) If it's right, unfortunately, I still got the same results as before fixed. (mm.P.shape == mm.active_set.shape != mm.pi.shape).

fabian-paul commented 4 years ago

You will have to install the new version on top of you existing installation. cd into fix_memm_stat_dist and then run python setup.py develop You will also have to rerun TRAM because the fix will not change the interpretation of your npy files.

Sizhang92190 commented 4 years ago

Hi Fabian,

OK. Thanks. I'll try it asap. One more question is about the active set. I can get two active sets: one from tramobj.active_set and another one from mm.active_set. Sometimes, they're different. Which one should I trust? Thanks.

Best, Si

fabian-paul commented 4 years ago

Both active sets are good. The one form the tramobj gives you the set for which TRAM gives you free energy information (the stationary distribution is defined). The active set of the respective msms gives you the set of states for which you get kinetic information. Because TRAM does not reweight kinetic information directly, the states in the active set essentially reflect which states were seen by the unbiased MD simulations.

Sizhang92190 commented 4 years ago

Hi Fabian,

Thanks for your answers again. I installed the fixed version of pyemma basing on your suggestions. I didn't see any change. Maybe I did wrong. So I'm wondering if you could do a test with my input files. I prepared dtraj, btraj and ttraj and also provide a script. Then you can directly test it with your pyemma. (It'll take ~25 min.) test_inputfiles_for_tram.zip

Best, Si

fabian-paul commented 4 years ago

Hi Si, sorry, I forgot to mention that you have to check out the "fix_memm_stat_dist" branch. The workflow is the following:

git clone git@github.com:fabian-paul/PyEMMA.git
cd PyEMMA
git checkout fix_memm_stat_dist
python setup.py develop
Sizhang92190 commented 4 years ago

Hi Fabian,

I still didn't see any change on the msmobj which is generated from tram esitmator. And about computing mfpt with your below command, I'm think if we need multiply with the lagtime.

""""import msmtools msmtools.analysis.mfpt(mm.P, np.where(np.in1d(mm.active_set,[19, 26]))[0], np.where(np.in1d(mm.active_set,[0]))[0])""""

Best, Si

yunhuige commented 4 years ago

Hi Fabian,

Please allow me to chime in here. As @Sizhang92190 mentioned, we found this issue in pyemma and we believe it is caused by the "index" of states when you append a list of MSM class to the TRAM object. By digging into the code, we kind of found the functions to compute MFPT in pyemma and did the following to fix the issue:

First, we import the function for computing MFPT:

from msmtools.analysis.dense.mean_first_passage_time import mfpt, mfpt_between_sets

Then, we define the unbiased MSM from TRAM is tram_msm and let's say the active set from TRAM is defined as act_tram and the active set of the unbiased MSM (one of the models from TRAM) is defined as tram_msm_act. Note that these two active sets may be different (len(act_tram) > len(tram_msm_act) is True at least in our cases so far). Then we define the population estimated by TRAM as old_mu.

Now we need to prepare the population for states that included in the tram_msm_act:

mu = []
for state in tram_msm.active_set:
        ind = np.where(act_tram==state)[0][0]
        mu.append(old_mu[ind])
mu = np.array(mu)

Then we can compute the MFPT as following (assuming state 0 is the sink and state 17 is the source state, the lagtime we used is 1.0 ns):

1.0*mfpt_between_sets(tram_msm.P,np.where(tram_msm_act==0)[0][0],np.where(tram_msm_act==17)[0][0],mu=mu))

This is the way I think could fix the issue. But there should be a more direct way (e.g. fix the line in TRAM estimator when appending MSMs to the list with the conformational state population based on their own active set instead of the global TRAM active set). What do you think?

fabian-paul commented 4 years ago

Hi Yunhui, hi Si,

let me answer your questions/comments one by one.

Hope that helps, Fabian

yunhuige commented 4 years ago

Hi Fabian,

The old_mu actually has the same dimension as the act_tram. For example, you tried to cluster into 100 states and in the TRAM active set you may only have 98 states left. Then the population from TRAM has 98 states population. So I think stick to the active set and make sure your population information correspond to the active set states are important. In this case I think using tram.stationary_distribution_full_state may return 100 states which will not correspond to the active set. So when I tried to prepare the population for the subset (e.g. tram_msm_act) I didn't use tram.stationary_distribution_full_state.

fabian-paul commented 4 years ago

Ah, of course you are correct to use tram.stationary_distribution in conjunction with act_tram. And to use tram_msm.stationary_distribution together with tram_msm_act. Forget what I wrote above.

Also note that this was broken before the fix. tram.stationary_distribution was incompatible with tram.active_set before the fix, and tram_msm.stationary_distribution was incompatible with tram_msm.active_set so please check you results after installing the fix. I apologize for the inconvenience. With the fixed version, you simply use tram_msm.stationary_distribution for the mfpt computation.

yunhuige commented 4 years ago

Hi Fabian,

You are right about the tram_msm.stationary_distribution and tram_msm.active_set. But I believe the tram.stationary_distribution is compatible with tram.active_set before the fix, at least in my cases.