[user story] absolute binding free energies on a large scale

In broad terms, what are you trying to do?

Run absolute binding free energies (ABFE) for host-guest or protein-ligand systems on a scale ranging from 100-1000 systems. This is for benchmarking new force field parameters/functional forms. The other point was discussed in the first meeting as probably outside of the scope of the project -- possibly integrate ForceBalance to F@H to enable optimization.

How do you believe using this project would help you to do this?

Using the resources available in F@H ABFE calculation in the 1000s should be possible. Right now we can probably do ~100 systems in a reasonable time but at times require manual handling by the user. I believe this project can help in performing these calculations automatically with as little intervention from the user.

What problems do you anticipate with using this project to achieve the above?

The windows in the attach-pull-release method are independent of each other so there are no information sharing until the analysis part. However, one problem I see is that there could be a chance that the simulation for a window(s) crashes for some reason. Probably having a retry logic in place will prevent the user from rerunning the whole calculation from scratch. Related to this, if a workflow is interrupted will the user need to rerun the workflow from scratch? The other problem I see is the amount of data generated by the simulation, can be in the 100s of TB.

scale ranging from 100-1000 systems

@jeff231li: By "systems", do you mean "complexes, where the receptor is shared for all systems", or do you mean completely distinct systems, where the receptor may vary widely from system to system?

The other point was discussed in the first meeting as probably outside of the scope of the project -- possibly integrate ForceBalance to F@H to enable optimization.

While full integration of ForceBalance may be difficult, PropertyCalculator uses a workflow-based paradigm where it could be very plausible for a stage of the calculations to either be aggregated to run in batches on Folding@home, or for an initial batch of reference calculations to be run on Folding@home via fah-alchemy and fed into PropertyCalculation. For example, if your goal is to optimize a force field to match protein-ligand absolute or relative free energy differences, and initial expensive fah-alchemy calculation for the initial force field parameters could be performed, and subsequent PropertyCalculator calculations could rapidly compute the gradient in free energy (and, to some extent, the perturbed free energies) from simply running endpoint calculations of the bound complex and unbound components---no further alchemical calculations are needed in principle.

The windows in the attach-pull-release method are independent of each other so there are no information sharing until the analysis part. However, one problem I see is that there could be a chance that the simulation for a window(s) crashes for some reason. Probably having a retry logic in place will prevent the user from rerunning the whole calculation from scratch. Related to this, if a workflow is interrupted will the user need to rerun the workflow from scratch? The other problem I see is the amount of data generated by the simulation, can be in the 100s of TB.

APR was not mentioned prior to this paragraph---do you envision wanting to run APR free energy calculations on Folding@home, or alchemical free energy calculations?

Calculations that would run multiple replicates on Folding@Home for each of many variants of a global parameter (such as an umbrella center) are very easy to set up and run right now. There is also automatic recovery built-in in case a single simulation encounters a NaN---it will try several times to recover from the last good checkpoint. Together with multiple replicates, this usually results in robust results.

The big question for bringing in APR would be whether the data flow model sufficiently resembles the other use cases that there would be an advantage in the automation provided by fah-alchemy. If you have a large benchmark you can easily set up with APR now, we could easily set up a calculation on FAH with some custom scripts. The key for fah-alchemy will be identifying common API and data models that would enable it to be integrated seamlessly as an alternative.

@jeff231li: By "systems", do you mean "complexes, where the receptor is shared for all systems", or do you mean completely distinct systems, where the receptor may vary widely from system to system?

By "systems" I mean H-G or P-L complexes where the receptor may vary widely from system to system (though each receptor may have multiple ligands). I'm looking at calculating binding free energies for a large set of mixed H-G and P-L complexes.

For example, if your goal is to optimize a force field to match protein-ligand absolute or relative free energy differences, and initial expensive fah-alchemy calculation for the initial force field parameters could be performed, and subsequent PropertyCalculator calculations could rapidly compute the gradient in free energy (and, to some extent, the perturbed free energies) from simply running endpoint calculations of the bound complex and unbound components---no further alchemical calculations are needed in principle.

Do you mean optimizing FF parameters by way of reweighting?

APR was not mentioned prior to this paragraph---do you envision wanting to run APR free energy calculations on Folding@home, or alchemical free energy calculations?

Yes, I want to run APR calculations on Folding@home, if possible. If not, then alchemical will also work for absolute binding FE.

The big question for bringing in APR would be whether the data flow model sufficiently resembles the other use cases that there would be an advantage in the automation provided by fah-alchemy. If you have a large benchmark you can easily set up with APR now, we could easily set up a calculation on FAH with some custom scripts. The key for fah-alchemy will be identifying common API and data models that would enable it to be integrated seamlessly as an alternative.

Right now for H-G calculations, Evaluator reads in metadata from Taproom to build the workflow for the APR calculations. I have about 100 H-G complexes prepared and plan to add P-L complexes as well. Does this look like something that can work with fah-alchemy?

Raw notes from story review, shared here for visibility:

the system shouldn't just be able to handle protein + ligand systems; it should support host+guest systems as well
system would be an improvement over current state in terms of simulation volumes possible; ~100 possible with current resources, but F@H could enable an order of magnitude more at least
aim is to require little manual intervention
- failure-first as a feature will be important for this, meaning that we get as much useful information from failure as technically possible
for the attach-pull-release method/protocol, multiple simulations are run as separate (lambda?) windows; will need to evaluate if this protocol can be done via F@H
- will need to study this method in particular; is it kind of like reaction coordinate pulling?
will need to evaluate how information from e.g. Taproom would be translated into data models for H-G execution

openforcefield / alchemiscale

[user story] absolute binding free energies on a large scale #4