msmbuilder / msmbuilder2022

Statistical models for biomolecular dynamics
GNU Lesser General Public License v2.1
37 stars 17 forks source link

The last step in the tutorial is a little different #26

Closed 117654321 closed 2 months ago

117654321 commented 3 months ago

I have been diligently working with MSMBuilder, following the tutorial meticulously. However, I have encountered a challenge in the final step where the graph I produced (refer to Figure 1) markedly differs from the one shown in your example (see Figure 2). I seek an understanding of this deviation and kindly request some guidance to address this issue. Could you please provide some insight? 图片1 图片2

bojunliu0818 commented 3 months ago

Hi @117654321,

Thank you for contacting us and raising the issue. I have followed the tutorial and successfully reproduced similar results to what you have shared. I am not exactly sure what causes this difference with the result provided in the tutorial (seems like due to different settings of clustering, e.g., possibly different random seed choices or changes in some default hyperparameters, etc.), but I believe your result should be correct.

Here is the information about the Python environment I used to conduct the test: Python = 3.6, msmbuilder = 3.8, mdtraj = 1.9.7, numpy = 1.19.5, scikit-learn = 0.19.0, etc.

117654321 commented 3 months ago

Are the results derived in this manner reasonable and unambiguous? Why are there more microstates on the left, but their populations are smaller? If the same trajectory leads to completely different outcomes due to a random number of reasons, is the program reliable?

bojunliu0818 commented 3 months ago

Thank you for raising this good question. In my opinion, the observed randomness can be attributed to two factors:

(1). Minibatch KMeans Clustering: This algorithm randomly samples input data at each iteration to enhance computational efficiency. However, this random sampling can introduce additional randomness, potentially leading to worse performance compared to traditional KMeans clustering.

(2). Number of tICs selected: The clustering is conducted in a 5-dimensional space defined by 5 tICs. This relatively high dimensionality for Minibatch KMeans clustering (only 500 clusters) may introduce randomness and large discretization error.

Therefore, I would suggest you trying different clustering algorithms (such as KMeans, or KCenters) in a lower dimensional tICA space (e.g., 3 tICs). Also, you can plot the implied time scales or gmrq score of your MSM model to validate these different algorithms or hyperparameters you used to construct the model.

117654321 commented 3 months ago

Thank you for raising this good question. In my opinion, the observed randomness can be attributed to two factors:

(1). Minibatch KMeans Clustering: This algorithm randomly samples input data at each iteration to enhance computational efficiency. However, this random sampling can introduce additional randomness, potentially leading to worse performance compared to traditional KMeans clustering.

(2). Number of tICs selected: The clustering is conducted in a 5-dimensional space defined by 5 tICs. This relatively high dimensionality for Minibatch KMeans clustering (only 500 clusters) may introduce randomness and large discretization error.

Therefore, I would suggest you trying different clustering algorithms (such as KMeans, or KCenters) in a lower dimensional tICA space (e.g., 3 tICs). Also, you can plot the implied time scales or gmrq score of your MSM model to validate these different algorithms or hyperparameters you used to construct the model.

Received, thanks for the guidance

117654321 commented 2 months ago

Hi @bojunliu0818 ,

Thank you very much for your previous reply! I am doing tests with different combinations of numbers of tica, number of clusters, and clustering methods as what you have suggested. In the mean while, I have another question about lag_time. The lag_time in tica.py is 100, which means a lag time of 100 50 ps = 5 ns. The covariance matrix is calculated based on the frames with a interval of 100 frames. However, in the microstate.py file, I am confused which value lag_time should be. Should lag_time be set to 100 as in tica.py or 5 implying 5 ns? In the class MarkovStateModel, lag_time participates in several calculations. In _transition_counts, lag_time is used to define the starting states i and the ending state j, therefore, I think lag_time should be 100 for the interval. By setting lag_time to 100, the actual lag time between state i and state j is 5 ns. If lag_time is set to 5 in the microstate.py, then the actual lag time between i and state j is 5 50 ps = 250 ps, which is not consistent with the actual lag time in tica.py. However, if lagtime is set to 100, then msm.summarize() in microstate.py will not print the correct timescales, because the method timescales() that msm.summarize() uses simply calculates timescales by timescales = - self.lag_time/np.log(u[1: ]), however, the unit of lag_time is not ns and lag_time means 100 frames. That was the contradiction about lag_time that I found in the msm calculation.

Other calculations in MarkovStateModel involving lag_time are:

  1. in _transition_counts, the counts matrix is calculated by: counts /= float(lag_time)

Is lag_time in this equation supposed to be 100 or 5? I have looked up the paper JCP-2011-134-174105, I could not find this equation. I could only find equation (46) 46

which is similar to the above equation. However, the denorminator is N-l, which in fs_peptide is 10000-100=9900 (I guess), not 100. So I not sure what the lag_time is supposed to be in the above equation.

  1. in _parse_ergodic_cutoff, it returns 1.0 / self.lag_time. Is lag_time supposed to be 100 or 5?

Sorry for taking your precious time and thank you very much for your help and guidance!

bojunliu0818 commented 2 months ago

Certainly, and thanks for bringing up another good question. When referring to any "time" in this context—whether lag time or implied time scales—the unit represents one-time step (or the saving interval between frames) of your trajectories. Therefore, you're absolutely correct to use lag_time=100, which represents 100-time steps (or frames) equivalent to 5 ns. Additionally, when you calculate the time scales, msmbuilder will output these with the unit of time steps, meaning you'll need to manually convert their units from 'time steps' to 'ns'. (For example, the time scale is 500, meaning 500*50ps=25ns) Similarly, in equation (46), N represents the number of frames in the trajectory (i.e., the total length of the trajectory with the unit of 'time steps'), while l represents the lag time (also in 'time steps' unit). In _parse_ergodic_cutoff, lag_time should also be 100.

117654321 commented 2 months ago

Thank you so much for your reply! Now it is clear to me that lag_time represents the number of frames. I still have a question about the lagtimes specified in the timescales.py file in the msm folder. Lagtimes in timescales.py is specified as a list of the powers of 2, i.e., [1, 2, 4, 8, 16, 32, 64, 128]. These lag times in the list are passed to MarkovStateModel to calculate their corresponding implied time scales and the figure implied-timescales.pdf is plotted based on these implied time scales. My question is that none of these lag times are 100, that means that these lag times are not consistent with the lag time used in tica.py, are the time scales calculated with these lag times meaningful? Or in another word, do the lag times in tica.py, microstate.py, and timescales.py always have to be the same?

I have another question about the eigenvectors calculated in msm.summarize(). Msm.summarize() calls _normalized_eigensystem to calculate the left eigenvector (lv) and right eigenvector (rv) to do the normalization. I printed normalized lv and rv to calculated their norms manually and I found that none of their norms are equal to 1. I am curious about the reason why their norms are not 1.

Again, thank you very much for your precious time and your patient and clear explanation!

bojunliu0818 commented 2 months ago

No problem! And for these two questions:

(1) The lag times used in tICA and MSM don't need to be the same because they apply to different models. Typically, when constructing a Markov state model based on states, it's recommended to plot lag time versus implied time scales, as demonstrated in timescales.py. This approach allows you to determine the lag time at which the implied time scales don't increase much (ideally should remain unchanged). This lag time is referred to as the Markovian lag time, indicating that the model is approaching Markovian. At this lag time, you can then construct the Markov state model to predict long-time scale dynamics or estimate kinetic properties.

(2) The term 'normalized' here doesn't refer to the traditional L2-norm of a vector. This is because the inner product in the eigenspace of the transfer operator (or the transition probability matrix in a discrete state space) is weighted by the stationary distribution. So, when we talk about normalized eigenmodes—using the right eigenvector $rv$ as an example, it means that $\sum_j rv[j,i]*rv[j,i]*\mu[j]=1$ where $\mu$ is the stationary distribution which is the first left eigenvector $lv[:,0]$ of TPM

117654321 commented 2 months ago

没关系!对于这两个问题:

(1) tICA 和 MSM 中使用的滞后时间不需要相同,因为它们适用于不同的模型。通常,在构建基于状态的马尔可夫状态模型时,建议绘制滞后时间与隐含时间尺度的关系,如 timescales.py 所示。通过这种方法,您可以确定隐含时间尺度不会增加太多的滞后时间(理想情况下应保持不变)。这个滞后时间被称为马尔可夫滞后时间,表明模型正在接近马尔可夫滞后时间。在此滞后时间内,您可以构建马尔可夫状态模型来预测长时间尺度动力学或估计动力学特性。

(2) 这里的术语“归一化”并不是指向量的传统 L2 范数。这是因为转移算子的特征空间中的内积(或离散状态空间中的转移概率矩阵)由稳态分布加权。因此,当我们谈论归一化特征模态时 - 使用正确的特征向量 r v 例如,这意味着 ∑ j r v [ j , 我 ] ∗ r v [ j , 我 ] ∗ μ [ j ] = 1 哪里 μ 是稳态分布,是第一个左特征向量 l v [ : , 0 ] TPM 的

Thank you for your response, thank you.

117654321 commented 2 months ago

Hello @bojunliu0818 , I apologize for disturbing you again. After reviewing the historical issues, I found your email address to be bliu293@wisc.edu. I have sent you a private email; could you please take a look when you have a moment?