torchmd / torchmd-protein-thermodynamics

Tutorials and data necessary to reproduce the results of publication Machine Learning Coarse-Grained Potentials of Protein Thermodynamics
73 stars 10 forks source link

How to choose parameter when plotting free energy surfaces? #16

Open yusowa0716 opened 8 months ago

yusowa0716 commented 8 months ago

Hello,

Thank you for the awesome work provided.

Upon reading both the paper and the tutorial, I noticed that the tutorial focuses exclusively on a single protein. During the process of constructing the Free Energy Surface (FES), I've identified several parameters that require definition, including Tica lag time, Tica dimensionality, the number of Kmeans clusters, MSM lag time, and the number of MSM macrostates. In the reference MD simulations, these parameters are indicated in the filenames. However, for the coarse-grained MD simulations, the tutorial only supplies details for protein G.

It would be immensely helpful if a table could be compiled, detailing these parameters for all 12 proteins covered in both the REFERENCE and CG MD simulations. Furthermore, the inclusion of the 'skip' values used in the CG MD would be of great assistance. If possible, I would also greatly appreciate any guidance or shared experiences regarding the parameter selection process for novel proteins or simulations.

Thank you once again for your time and the valuable resources you have provided. I look forward to any assistance you can offer on this matter.

Warm regards, XJTUNR

AdriaPerezCulubret commented 8 months ago

Hi!

For the reference MSMs, you can use a TICA lag time of 20 steps for all models, and project the main 3 TICA dimensions. Kmeans cluster number varies, between 600-1200 depending on the system, but you should get similar results with any number of clusters. MSM lag time and macrostate number also depends on the system.

For the CG models, we used more or less the same hyperparameters for all models. Tica lag is the same as reference, since we are using the same covariances. We projected 3 TICA dimensions, clustered into 200 Kmeans clusters, and used an MSM lagtime of 0.01 ns and between 3 to 5 macrostates. The only exception I believe is chignolin, were we used a lagtime of 0.001 ns. Here's the exact command: cgmodel.markovModel(0.01, 5, units='ns')

The CG models were we skipped frames are lambda-repressor and protein G, were we skipped every 2 frames.

If I can find time I'll upload all the exact values in the repo both for reference and CG, but these are the general guidelines.

yusowa0716 commented 8 months ago

Thanks for your general guidelines.

I'm looking forward to the exact values for each protein.