[user story] COVID Moonshot and ASAP large-scale free energy calculations for synthesis prioritization

In broad terms, what are you trying to do?

The COVID Moonshot and its successor, ASAP, are pursuing open science patent-free antiviral drug discovery projects for the public good. The process of hit-to-lead and lead optimization involves generating large virtual synthetic libraries where a common intermediate is used to make many analogues using a large library of building blocks from CROs like Enamine, WuXi, and Sai: Predictions over these large virtual synthetic libraries are used to prioritize compounds for synthesis within both hit-to-lead and lead-optimization phases:

As input, we would like to incorporate both manually submitted compound designs and enumerated virtual synthetic libraries (example) and automate the workflow of selecting appropriate X-ray structures with related reference ligands from available X-ray structures (where many co-crystal structures are often available), preparing the systems for free energy calculations, building a transformation network that connects reference ligands with designs (including redundancy), executes the relative free energy calculations on Folding@home, performs on-the-fly analysis, and provides up-to-date distilled results in a manner that can be pulled into a dashboard (example from fah-xchem) that can be used by the chemistry design teams to action designs for synthesis. All generated data will be openly archived and usable for a variety of research purposes (methodology improvement, ML, compound design) by anyone.

Future improvements can involve allocating effort adaptively among relative and absolute free energy calculations to enable more efficient execution, as well as using much more efficient single-replica methods like SAMS / Times Square Sampling in OpenMM and/or gromacs.

How do you believe using this project would help you to do this?

The current process of setting up a COVID Moonshot Sprints involves manual execution of a sequence of scripts that prepare the calculations for execution on Folding@home, a significant amount of babysitting the launch of the Folding@home projects, and then a tailored automated analysis script that combines analysis and dashboard generation. Many of these steps are generalizable components for setting up, executing, and analyzing automated free energy calculations

Instead, by factoring all reusable components into conda-installable modules that use common data models and APIs, we can collaboratively build an ecosystem that can be assembled into reusable workflows that can automate, simplify, robustify, streamline, and optimize/improve this process to enable us to scale to support many discovery projects. This work should be synergistic with supporting other similar projects by providing compute capabilities for open discovery or research projects; additional use cases (such as prediction of resistance mutations and XChem Fragalysis) will be added soon.

What problems do you anticipate with using this project to achieve the above?

The most critical step is to identify all the reusable components and modules, clearly define data and object models capable of extensibility (e.g. transformations should support small molecule transformations for one or more protein targets, point mutations, transformations in other phases such as lipids) and clear base APIs that enable innovation in implementations without breaking APIs, as well as ensuring the ecosystem is conda-installable.

Storage requirements should pose less of a problem if we adopt a clear separation between high-value low-storage, mid-value medium-storage, and low-value large-storage categories:

High-value low-storage: Summarized analysis statistics (free energy estimates, uncertainties, statistical inefficiencies, statistical fluctuation, other statistical quantities) are easily stored as dict-like serializations (e.g. JSON) or more compressed serializations (BSON, MessagePack, etc). This data is easily provided at low cost in easily accessible locations, such as S3, or served via RESTful APIs.
Med-value med-storage: Extracted representative snapshots or binding poses / structural ensembles; work value distributions, NxK energies in all thermodynamic states, or dU/dlambda trajectories in portable numerical formats; representative trajectories. This data is reasonably easy to provide via S3, with URIs served via RESTful APIs.
Low-value high-storage: Raw trajectories; raw simulation outputs; logs; etc. Some of this data may never need to be inspected but might be desirable to retain for reproducibility or data retention policy purposes. Requestor-pays and high-latency storage solutions (such as Glacier and GCE equivalents) are likely good candidates.

Raw notes from story review, shared here for visibility:

perhaps the most complex use case among the user stories; would generate graphs including both RBFE, ABFE transformations, operates in a tight feedback loop
less interest in accurate binding free energy values than in accurate ranking of candidate compounds for synthesis
graphs would intentionally include redundant paths to compound designs compared to reference ligands (for which we have crystal structures for binding poses)
lays out vision for reusability of components to support many discovery projects, not just the ones we are directly involved with
- the upfront challenges are harder, but the long-term benefits greater with time investment
raises teiring of data storage, probably each with its own retention policy; each tier and its data elements should have some configurables (such as snapshot frequency), with per-project limits; should also consider retention time limits for each tier set at org, campaign, project level
- high-value, low-capacity : summarized analysis stats (free energy estimates, uncertainties, statistical inefficiencies, statistical fluctuation, other statistical quantities)
- med-value, med-capacity : representative snapshots or binding poses / structural ensembles, work value distributions, NxK energies in all thermodynamic states, DU/dlambda trajectories in portable numerical formats; perhaps still small enough to be served under HTTP via RESTful API
- low-value, high-capacity : raw trajectories, logs, etc.; probably stored in high-latency storage such as Glacier with requestor pays active

openforcefield / alchemiscale

[user story] COVID Moonshot and ASAP large-scale free energy calculations for synthesis prioritization #5