Closed kexul closed 2 years ago
Hi @kexul ,
@JenkeScheen 's dataset somd_run_1_i.csv used in the BSS tutorial is work in progress and the protocol can be optimised for production runs.
To bring down the hystereis it is advisable to:
carry out more replicates (we find x3 or x5 is advisable)
vary the number of lambda windows used for each perturbation. Cresset has developed an adaptive sampling procedure to help select a good number of windows per perturbation that you may find interesting. Alternatively use heuristics to assign more windows to 'noisy' perturbatiobs
optimise the mappings between pairs of compounds. The default settings to generate a mapping between pairs of molecules with BioSimSpace may not be optimal for all compound pairs. For Tyk2 we have some evidence that some mappings perform more poorly than others. For instance for ejm31 --> ejm46 a mapping that has the amide -[CH3] of ejm31 duplicated to overlap with an entire cyclopropyl group presented in ejm46 gives very noisy results. Much more precise results with low hysteresis are obtained if the carbon atom of the perturbed -[CH3] group in ejm31 is mapped onto the -CH(C) group of the cyclopropyl ring in ejm46.
avoid building networks that favor edges that deviate systematically from experiment (this is more related to accuracy than precision but not entirely disconnected as inprecise FEP estimates tend to agree more poorly with experimental data).
Generally considering these 4 points can help improve significantly performance. I hope that's helpful.
Thanks for your valuable suggestions, much appreciated!
No problem, feel free to post an update if you want to discuss data comparing different optimisations.
No problem, feel free to post an update if you want to discuss data comparing different optimisations.
Glad to hear that, I'll post my findings, with the hope that the broader community would benefit from our conversations here. Also, thanks for your precious time! @jmichel80
Actually, in my daily practice, I've followed the 4 points you suggested earlier: Normally I use 3 replicates, 32 lambda windows, avoid ring-breaking transformation, and use predefined edges from literature.
Unfortunately, large hysteresis still exists, maybe there are other critical points we've missed?
Here is the result when trying to reproduce the Tyk2 result from the literature, I've completed two runs, and the third run is still running (some of the columns are blank because of the simulation failure):
The average hysteresis is 0.96. Three of them have the largest hysteresis:
ejm42~ejm54
: 3.12ejm42~ejm55
: 3.04ejm54~ejm55
: 2.21It's weird that these are just simple transformations: no cyclical group is involved and the number of atoms perturbated are relatively small.
Here are the molecules, with pIC50 value as the edge (experiment measured value in backet):
Here are my input and output files: github.zip
Though we could still draw a decent figure from the above result when converted G to G and compared to expriment result, the hysteresis make one worried about the reliability of the result.
Just to check, are you using the same mapping for both directions, i.e. with it inverted for the perturbation in one direction? From discussions elsewhere, I did hear talk of certain situations where mappings weren't symmetric, so if you were considering a pipeline where you generated a separate mapping for both directions (A-B and B-A) then you could be creating a different perturbation.
The fact that the results between replicates for ejm42~ejm54 and ejm54~ejm42 are consistent suggest there is a systematic discrepancy between the mappings used.
If you look at the MBAR free energy changes in the outputs folder you have in kcal.mol
perturbation | Bound | Solvated |
---|---|---|
ejm42~ejm54 | -47.1 | -46.2 |
ejm54~ejm42 | 53.3 | 50.8 |
Normally the free energy changes would be of opposite sign. We occasionally see hysteresis in the bound leg, but typically the solvated leg figures are very consistent for a small perturbation like this one.
There is more evidence that the models differ by inspecting the TI gradients at lambda 0 and 1.
perturbation | dG/dl, solvated, l=0 | dG/dL, solvated, l=1 |
---|---|---|
ejm42~ejm54 | -45.15 | -41.19 |
ejm54~ejm42 | 50.3 | 47.1 |
If the mapping is completely symmetric the gradients should have opposite sign and same magnitude at l=0 and l=1.
Could you share the input files given to SOMD for these calculations ? Check whether the MORPH.pert input files describe the same perturbation. This could be related to the comment @lohedges just posted.
It would also be easy enough for us to check single-point energies for both mappings (both directions) to make sure that they are the same. We have some tests to do this already, i.e. checking that the single-point energies for the two extracted lambda = 0/1 end states agree with the original molecules, but these aren't exhaustive, i.e. only for a few ligand pairs.
Thanks @jmichel80 and @lohedges for the comments, I just checked the mapping, and here is the result:
perturbation | hysteresis | same mapping ? |
---|---|---|
ejm42~ejm54 | 3.1216865 | FALSE |
ejm42~ejm55 | 3.0451565 | FALSE |
ejm54~ejm55 | 2.2105375 | TRUE |
ejm46~jmc30 | 1.553418 | FALSE |
ejm45~jul01 | 1.227037 | TRUE |
jmc28~jmc30 | 1.112306 | FALSE |
jmc27~jmc28 | 1.1054535 | TRUE |
ejm46~jmc28 | 1.0416595 | FALSE |
jmc23~jmc27 | 0.979139 | TRUE |
jmc23~jmc28 | 0.977455 | FALSE |
ejm42~ejm50 | 0.8677685 | FALSE |
ejm47~jul01 | 0.8585965 | TRUE |
ejm31~ejm50 | 0.7911985 | TRUE |
ejm49~jul01 | 0.609498 | TRUE |
ejm31~jul01 | 0.5632835 | TRUE |
ejm42~ejm43 | 0.4374285 | FALSE |
jmc28~jul01 | 0.41609 | TRUE |
ejm48~jul01 | 0.376608 | TRUE |
ejm42~ejm44 | 0.216705 | FALSE |
ejm43~ejm44 | 0.1529505 | FALSE |
ejm46~jul01 | 0.1476935 | TRUE |
ejm31~ejm45 | 0.1444715 | TRUE |
ejm31~ejm42 | 0.0122185 | TRUE |
For the example (ejm42~ejm54) I provided, the only difference in the mapping is that in the forward direction, atom 26
was mapped to atom 28
, but in the backward direction atom 28
was mapped to atom 25
:
I'm quite surprised that the small difference in the mapping had such a huge impact.
Here are the input files given to SOMD.
Will check the single point energy later. Will enforce the mapping and run the simulation again. Thanks again for your advice, learned a lot from that.
Thanks for this, I'll try to take a more detailed look later. Given the differences in the mappings I think this is something that we need to warn the user about in our documentation. It may also be possible to do some tricks to enforce an identical mapping, e.g. perhaps finding the MCS in both directions in matchAtoms
, scoring both, then returning the one with the highest score. That way you'll get the same mapping either way. (Perhaps there is an easier way to do this.)
I'm not sure if the mappings are different at the MCS stage, or when being scored. For example, I could imagine that the RMSD alignment might give slightly different results in mapping A to B, rather than B to A.
For reference, the unit test here could be adapted to check that forward/reverse merges for a ligand pair give the same energetics.
These two mappings SHOULD be completely equivalent as atoms 25/26 are symmetry related, i.e. 25/26 should have exactly the same parameters.
But there is something wrong with the input files.
In somd.pert for 42-54 there are 16 perturbed atoms. In somd.pert for 54-42 there are 15 perturbed atoms.
Notable in 42-54 I see entries for perturbing the partial charge of the oxygen atoms
atom
name O
initial_type o
final_type o
initial_LJ 3.04812 0.14630
final_LJ 3.04812 0.14630
initial_charge -0.59001
final_charge -0.62510
endatom
atom
name O1
initial_type o
final_type o
initial_LJ 3.04812 0.14630
final_LJ 3.04812 0.14630
initial_charge -0.55001
final_charge -0.55110
endatom
But no such entries are found in 54-42. For the first entry above the partial change is significant (-0.59 to -0.62). This will influence the MBAR free energy changes and gradients and could explain the inconsistency.
In 42-54 there are 3 nitrogen atoms (name N, N1, N2) listed somd.prm7 but the names don't appear anywhere in the pert file. In 54-42 there are 4 nitrogen atoms (N, N1, N2, N3) and 3 are listed as perturbed in the pert file.
Can you confirm that the pert files haven't been mixed up with inputs for another perturbation ? If not, we need to inspect your entire python pipeline to understand why this has happened.
@lohedges alternatively we could by default calculate a mapping based on e.g. a decreasing/increasing number of heavy atoms, that would prevent having to do a double MCS (can get tedious for some scaffolds out there). Could potentially include a flag to disable this behaviour.
@jmichel80: I've locally repeated the mapping for 42-54 and don't see the same behaviour. The mappings in both directions are identical and the pert files contain the same number of perturbed atoms, with the initial/final parameters reversed, e.g.:
import BioSimSpace as BSS
# Load.
m0 = BSS.IO.readMolecules("inputs/ligands/ejm_42.pdb")[0]
m1 = BSS.IO.readMolecules("inputs/ligands/ejm_54.pdb")[0]
# Parameterise.
m0 = BSS.Parameters.gaff(m0).getMolecule()
m1 = BSS.Parameters.gaff(m1).getMolecule()
# Map.
mapping0 = BSS.Align.matchAtoms(m0, m1)
mapping1 = BSS.Align.matchAtoms(m1, m0)
# Check they are the same.
for k,v in mapping0.items():
assert k == mapping1[v]
# Merge.
merged0 = BSS.Align.merge(m0, m1, mapping0)
merged1 = BSS.Align.merge(m1, m0, mapping1)
# Create pert files.
BSS.Process._somd._to_pert_file(merged0, "morph0.pert")
BSS.Process._somd._to_pert_file(merged1, "morph1.pert")
(Using latest devel. Working in 04_fep
directory.)
Ah, I'm so sorry that I did not check it by myself earlier. . . 😰
I just compared the output .pert
file using my local modified version of BioSimSpace with the latest dev, It seems like I've broken something that's essential for merging two molecules.
The latest dev does give the same number of perturbed atoms and parameter reversed.
No problem. When developing it might be worth running the unit tests locally from time-to-time to make sure that things don't break, i.e. via:
pytest -s -vvv test
(I like to add -s -vvv
so that I get as much output as possible. Can be omitted as necessary.)
I assume some of the tests in test/Align
would fail if they are no longer returning the expected mappings. (Although it's possible they might not. If so, it might be good to add a unit test to try to catch this.)
Cheers.
@kexul let us know if you get more consistent results after fixing your build. It will be useful to discuss what can be done to improve the precision of certain mappings, I think there is scope to optimise the default mapping settings used by BSS.
No problem. When developing it might be worth running the unit tests locally from time-to-time to make sure that things don't break.
Thanks, I will surely follow that code of conduct.
@kexul let us know if you get more consistent results after fixing your build.
Sure, I'll post updates here. 😄
I've finished two runs with the latest dev build. The average hysteresis is 0.358, which is very close to the value reported in the paper. 🎉🎉🎉
Thanks for all of your helpful discussions and kind support, much appreciated!
Great ! Out of interest what is the structure of jul01 ? It doesn't seem to be present in our own Tyk2 networks. It's interesting to see several edges involving this compound show larger variability. If you inspect closely the mappings it may be possible to find a more efficient way to process these edges.
jul01~ejm45 | -1.0599 | -1.9752
jul01~ejm46 | -1.09912 | -1.81115
jul01~ejm47 | -0.91361 | -2.08847
Hi @jmichel80 , I'm using the same edges as the paper did. Here is the file in the paper's supplement files described the perturbations. Though the structure of jul01
is not explicitly shown in the main context, it's easy to infer its structure by the MCS of the molecules it connected.
Here is the full print of my perturbation network, the larger variability may be due to the growth/annihilation of a whole ring. 😃
Ah yes I had forgotten about this network. Your data may interest @JenkeScheen there’s lots of way to build a network.
vary the number of lambda windows used for each perturbation. Cresset has developed an adaptive sampling procedure to help select a good number of windows per perturbation that you may find interesting. Alternatively use heuristics to assign more windows to 'noisy' perturbations
Hi @jmichel80, I just find the linked method quite interesting. May I ask, why we could use a fast (50ps) pre-calculation
to determine the lambda window? Wouldn't it provide an error-prone estimation of the overlap matrix because of insufficient sampling?
Hi @kexul This is a heuristic that is not guaranteed to give optimal results, but generally gives better performance than not optimising at all the allocation of computing resources. Longer runs would be necessary to sample equilibrium distributions but that is not necessary in this implementation. Qualitatively the structure of the overlap matrix is often rapid to establish and looks similar between bound/free legs. This is sufficient to make an informed guess on what windows can be skipped.
Many thanks for your detailed answer! @jmichel80 🤗
Dear BioSimSpace developers: Thanks for the detailed tutorials provided here, it offers a good starting point for beginners like me to start simulation. Followed the tutorial, I've conducted several free energy perturbation tasks, but there are two common problems I've encountered frequently:
I used to believe that's because I did not precisely follow the tutorial (eg. I used Gromacs to equilibrate the system ). But, after analyzing the fep output in this repo, I found the same problems exist here.
For example, the average hysteresis in somd_run_1_i.csv is 1.23 kcal/mol, for reference, the average hysteresis of TYK2 in this paper is 0.35 kcal/mol (calculated from the fep_results_freenrgworkflow_cresset_valid.csv in the supplementary material).
For deviation between different runs, the bar plot shows that some of the perturbations exceed 1 kcal/mol:
Are there some good practices we could follow to reduce the hysteresis and deviation? Thanks! 🤗