openforcefield / alchemiscale

a high-throughput alchemical free energy execution system for use with HPC, cloud, bare metal, and Folding@Home
https://docs.alchemiscale.org/
MIT License
22 stars 8 forks source link

Benchmark proteins (eventually other biomolecules) using F@H #9

Open mrshirts opened 2 years ago

mrshirts commented 2 years ago

In broad terms, what are you trying to do?

Benchmark OpenFF force fields for protein structure using F@H. We would want to launch a bunch of jobs with just plain ol' proteins in water to gather a large aggregate time. We will probably eventually need to do enhanced sampling, probably using Markov state modeling or other method that is appropriate to OpenFF, but just getting a lot of aggregate simulation is likely to help a lot to converge NMR observables.

We may need to run non-OpenFF force fields for comparison. amber14sb would probably be trivial, anything else could be hard (and maybe we don't bother?)

It does not matter what software is used, it should be the same results GROMACS or OpenMM.

How do you believe using this project would help you to do this?

We need to gather large amounts of aggregate simulation time to test NMR observables and other observables to compare to experiment. It is unlikely we will get enough aggregate simulation time without F@H to do statistically meaningful tests.

What problems do you anticipate with using this project to achieve the above?

chapincavender commented 2 years ago

I'll provide some more detail about the anticipated simulation needs. Based on previous protein force field benchmarks, we will estimate NMR observables (chemical shifts, scalar couplings, and NOEs) from unbiased MD simulations for three sets of protein systems:

This proposal would entail 6.9 ms of aggregate sampling from unbiased MD for all systems and force fields (2.3 ms per force field/water model). Going beyond the above will require enhanced sampling algorithms, and I agree with @mrshirts that we should rely on expertise from the Voelz and Bowman groups for this.

j-wags commented 2 years ago

I wonder if the requirements here are kind of a "zero" case for another workflow - Like #1 but with edge/transformation, or #6 but no ligand?

This may face the same criticism as #1 ("why do we need a giant supercomputer for this?"), and the answer is that (per Chapin's message above) the protein observable benchmarks add up to 6.9ms of simulation per shot. We'll want to take at least one shot to just benchmark the Rosemary release, and it will be really helpful to be able to use the same infrastructure to expand comparisons to include additional FFs, more proteins, or to test improvements to Rosemary.

So the inputs would be:

And the desired output would be the raw trajectory.

dotsdl commented 2 years ago

Raw notes from story review, shared here for visibility:

chapincavender commented 2 years ago

My preference for this use case is to have full trajectories as output and do the analysis locally. You can use multiple forward models for each type of observable, and you can use the same set of trajectories for comparison with multiple experimental observables. If we decide later than we want to swap out the forward model or include an additional observable, it will be useful to have the raw trajectories on hand.