Open jchodera opened 5 years ago
I've nearly finished a prototype of this. My current thinking is that we should create a separate repo that would contain these tools---at least for now.
What do you think about qcfractal-submit
, qcfractaltools
, or openforcefield-qcfractal
?
One possibility would be to use namespace-style packages and have this install openforcefield.qcfractaltools
or some other subpackage within the openforcefield
namespace.
@davidlmobley : Here's an illustration of using the API in preparing the PDB Ligand Expo dataset from Genentech:
for prefix in ['Neutral', 'Charged']:
# Load input molecules
oemols = cmiles.chemi.file_to_oemols(f'pubLigs{prefix}GoodDensity.sdf')
# Create an OptimizationDataset
factory = OptimizationDatasetFactory()
factory.input_filters = ['BlockBuster']
factory.enumerate_stereochemistry = False
factory.enumerate_tautomers = False
factory.fragment = False
factory.max_conformers = 20
factory.submit_filters = ['']
factory.compute_hessians = False
dataset = factory.create_dataset('PDB Ligand Expo {prefix} Genentech-filtered optimization dataset 1', oemols)
dataset.write_json(f'pubLigs{prefix}GoodDensity-OptimizationDataset.json.gz')
dataset.write_pdf(f'pubLigs{prefix}GoodDensity-OptimizationDataset.pdf')
dataset.write_smiles(f'pubLigs{prefix}GoodDensity-OptimizationDataset.smi')
# Create a TorsionDriveDataset
factory = TorsionDriveDatasetFactory()
factory.input_filters = ['BlockBuster']
factory.enumerate_stereochemistry = False
factory.enumerate_tautomers = True
factory.fragment = True
factory.submit_filters = ['Fragment']
factory.max_conformers = 2
factory.grid_spacing = 15
dataset = factory.create_dataset('PDB Ligand Expo {prefix} Genentech-filtered torsion dataset 1', oemols)
dataset.write_json(f'pubLigs{prefix}GoodDensity-TorsionDriveDataset.json.gz')
dataset.write_pdf(f'pubLigs{prefix}GoodDensity-TorsionDriveDataset.pdf', highlight_torsions=True)
dataset.write_smiles(f'pubLigs{prefix}GoodDensity-TorsionDriveDataset.smi', tag_torsion_atoms=False)
On the whole this looks excellent, @jchodera . This would make it so I'd actually construct datasets myself when on a time crunch, rather than just putting the source material there and wondering what should happen next. :)
Should there be something which allows toggling of obtaining/storing Wiberg bond orders?
In terms of naming, I like all of the options you proposed:
What do you think about qcfractal-submit, qcfractaltools, or openforcefield-qcfractal? One possibility would be to use namespace-style packages and have this install openforcefield.qcfractaltools or some other subpackage within the openforcefield namespace.
That said, I think we DO want to avoid the tendency to create a new repo for every new thing, because every repo creates some amount of new maintenance burden, etc. That said, this doesn't seem to be something which would already have a home so it may be a good idea.
@jchodera @davidlmobley I totally agree on the general plan, containing two steps:
Modularize and standardize the functionalities into a library. Provide a universal interface.
Seek approval from @dgasmith for authorization to submit and run calculations on public QCArchive server. (under "openff" tag)
For the past few days, I've been carefully thinking about the design of this openforcefield-qcfractal
library. I tried to understand the use cases of this library, and realized three main difficulties that prevented us from implementing this in the first place:
For various dataset submissions, customized features are often needed to deal with edge cases and unexpected science from the input data, and it's very hard to predict all such needs. We don't want to force users to edit the source code and merge a PR before each dataset can be submitted.
There is potentially a huge waste of resource, if the submission is done with bugs or misuse of the library.
Certain submission depends on the finish of others. One example is that we want to only submit a hessian dataset, after the optimizations are done. (This might be a desirable feature from QCPortal, but for now we have to deal with it)
To address the above difficulties, I think it's necessary to expose a high-level structure of the submission procedure to the user. One way is providing a heavily-commented, clear-structured script to the user, and it has several benefits:
New users reading this script will get a general idea of the submission procedure. This lowers the learning curve and reduces the chance of mistakes.
This script contains standardized features with default configurations, while also support adding specialized features. If the feature is widely useful, it will still be easy to contribute such functionality to the library in a PR, but no need for waiting to merge.
As the submission procedure involves interfacing with multiple packages (fragmenter, cmiles, geomeTRIC, torsiondrive, qcportal, etc.), detailed configuration of each interface can be fine tuned in the script. I think this is actually more friendly than a long list of optional arguments.
The script can be executed at chosen time. The user can call the hessian submission script, after optimization dataset finished computing.
This script can be used to launch "test submissions" in local test servers, thus greatly reduce mistakes and bugs in the final submission. If needed, it is even possible to automatically spin up a snowflake QCFractal server and test the submission before sending it to the public server.
When tests fail or result in unexpected data, the script provides a clear entry for debugging and fix the issues.
This script can ensure maximum reproducibility of each submission, while also serves as a documentation of how the submission was done. Such documentation is important especially when customized features are used.
This script is also highly reusable for similar submissions that uses the same customized features.
Here is my proposed design, from a user's perspective:
Step 1: Install the library:
conda install -c conda-forge openforcefield-qcfractal
Step 2: Use provided tool to generate a submission script:
openff-qcf-submit.py --type optimization
This command will print a brief message about what feature is selected as default, such as
Generating submit_optimization.py script
1. <fragmenter> is selected as default conformer generation tool
Max 10 conformers per molecule
2. Default filters selected: [ "like-drug", "less than 50 heavy atoms" ].
3. <geomeTRIC> is selected as the default optimizer.
4. B3LYP-D3(bj) / DZVP is selected as the default QM method.
Step 3: Open and edit the generated "submit_optimization.py" script file.
Step 4: Run the script to test submission.
python submit_optimization.py input_smiles.smi
Step 5: If everything is correct, run the final submission to production server
python submit_optimization.py input_smiles.smi -c qcf_public.yaml
An example "submit_optimization.py" script I wrote will be pasted below.
For experienced users that want the maximum convenience, steps 2-5 can be combined into a single command. For new users, it will be highly suggested to follow these steps for each new type of submission. If reproducibility is needed, the user should upload the submission script together with input file to a repository.
Here is an example of the generated submit_optimization.py
script.
#!/usr/bin/env python
"""
OpenForceField QCFractal Submission Script
Type: OptimizationDataset
Name: OpenFF test set 1
Date: 09-09-2019
"""
import openforcefield-qcfractal as offqcf
def step1_expand_conformers(input_smiles):
"""
expand conformers for input list of molecules
Parameters
----------
input_smiles: List[str]
A list of smiles string each corresponding to one input molecule
Returns
-------
molecules_data: Dict[str, Dict[str, any]]
The dictionary maps the label of a molecule to data. e.g.
{
molecule_label1: {
'conformers': [Conformer1, Conformer2, ..],
'attributes': {
'canonical_isomeric_smiles': ...,
}
molecule_label2: ...,
}
Notes
-----
1. The "label" will be used as index for the entry in the resulting QCArchive dataset
2. The "attributes" will be copied into each entry in the dataset, the content is optional
"""
molecules_data = {}
for smile in input_smiles:
# Expand protonation states and stereoisomers, default using fragmenter
molecule_states = offqcf.expand_states(smile)
# Each state is assigned as a new molecule
for molecule in molecule_states:
molecule_attributes = offqcf.get_molecule_attributes(molecule)
# the "canonical_isomeric_smiles" is used as label here
molecule_label = molecule_attributes['canonical_isomeric_smiles']
molecule_conformers = offqcf.generate_conformers(molecule, max=10)
molecules_data[molecule_label] = {
'conformers': molecule_conformers,
'attributes': molecule_attributes,
}
return molecules_data
def step2_filter_molecules(molecules_data):
"""
Filter out unwanted molecules_data
Parameters
----------
molecules_data: Dict[str, Dict[str, any]]
output of step 1
Returns
-------
filtered_molecules_data: Dict[str, Dict[str, any]]
Same format as input, with filtered data
"""
# run filter to keep molecules <= 50 heavy atoms
filtered_molecules_data = offqcf.filter_50_heavy(molecules_data)
# run filter to keep only drug-like molecules
filtered_molecules_data = offqcf.filter_drug_like(filtered_molecules_data)
# customized filters can be implemented here
return filtered_molecules_data
def step3_submit(molecules_data, client_config=None):
"""
Submit the Optimization Dataset
Parameters
----------
molecules_data: Dict[str, Dict[str, any]]
output of step 2, same format as output of step 1
"""
dataset_name = 'OpenFF test set 1'
# geometric settings
geometric_spec = {
"coordsys": "dlc", # prevent translation & rotation moves
"qccnv": True, # less tight convergence criteria
}
# qm specs
qm_spec = {
"method": "B3LYP-d3bj",
"basis": "dzvp"'
}
# create a client (client_config=None will do test submission, replace with a real config to submit to public server)
client = offqcf.create_client(client_config=client_config)
# run the submission
offqcf.create_optimization_dataset(client, dataset_name, geometric_spec, qm_spec, start_compute=True)
def main():
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('input_smi', help='Input smi file contain a list of SMILES string')
parser.add_argument('-c', '--client_config', help='public server client config file')
args = parser.parse_args()
with open(args.input_smi) as f:
input_smiles = f.readlines()
molecules_data = step1_expand_conformers(input_smiles)
filtered_molecules_data = step2_filter_molecules(molecules_data)
step3_submit(filtered_molecules_data, client_config=args.client_config)
if __name__ == '__main__':
main()
The other nice things about these thoughts is that would allow us (eventually, if our sponsors want to put developer time into this) allow us to make this process relatively agnostic as to cheminformatics toolkit, whereas right now our prep steps rely heavily on one specific toolkit.
One item that may also help which is (slowly) coming is a time estimator. So you can see when you have requested 10M core hours vs 10k core hours. Quantum chemistry is quite opaque to this.
I'm opening this issue to capture thoughts on a small library and CLI tool to aid QCArchive dataset submission for Open Force Field projects.
I'm thinking the library can do something like this:
However, there are so many arguments to
prepare_dataset
that it probably makes sense to instead design this as factories (one per dataset type) that can be configured via methods or fields:We can also add a CLI, though we want it to default to doing reasonable things but also allow overriding different options:
or also take the arguments from a JSON or YAML file for convenience
We could also think of adding the capability to actually submit this to QCFractal rather than having @dgasmith submit them on our behalf.