Automating QCArchive dataset submission

jchodera commented 5 years ago

I'm opening this issue to capture thoughts on a small library and CLI tool to aid QCArchive dataset submission for Open Force Field projects.

I'm thinking the library can do something like this:

from openforcefield import qcfractaltools
# Prepare dataset for submission to QCFractial
json_filename = 'dataset.json.gz' # the QCFractal input dataset to generate
molecules = [ Molecule.from_smiles(smiles) for smiles in molecule_smiles_list ] # list of Molecule objects to be prepared
name = 'My QCArchive dataset'
submitted_molecules = qcfractaltools.prepare_dataset( 
    dataset=json_filename,  # the JSON dataset name (`.json` or `.json.gz`)   
    molecules=molecules, # list of `Molecule` objects
    name=name, # the name assigned to the dataset
    fragment=True, # if True, will fragment and de-duplicate fragments within this dataset
    expand_protonation_states=True, # if True, will expand reasonable protonation states
    expand_tautomers=True, # if True, will expand reasonable tautomers for all protonation states
    input_filters=['Blockbuster'], # list of input filters to apply to initial molecule dataset
    submit_filters=['DFT'], # list of filters to apply to generated molecules to prevent problematic molecules from being submitted
    torsion_drives=['1D'], # if not None, list of which torsion drives to perform: '1D' and/or '2D'
    max_optimization_conformers=100, # if not None, will also generate optimization dataset with specified maximum number of conformers enumerated
    optimization_hessians=True, # if True, will also compute Hessians for minimized conformers (but not torsions)
    )
# inspect or visualize Molecule objects that were submitted
for molecule in submitted_molecules:
    ...

However, there are so many arguments to prepare_dataset that it probably makes sense to instead design this as factories (one per dataset type) that can be configured via methods or fields:

# Create a factory with default settings
from openforcefield.qcfractaltools import OptimizationDatasetGenerator
generator = OptimizationDatasetGenerator()
# Change some inherited molecule preparation settings
generator.fragment = True # fragment molecules
generator.enumerate_tautomers = True # enumerate reasonable tautomers
generator.input_filters = ['Blockbuster'] # apply some filters to the input molecules
generator.submission_filters = ['DFT'] # apply some filters to the submitted molecules to make sure we don't submit anything crazy to QM
# Change some OptimizationDataset-specific options
generator.max_conformers = 20 # maximum number of conformers per molecule
generator.compute_hessians = True # compute hessians after optimization complete
# Now generate a dataset
name = 'My QCFractal optimization dataset' # dataset name
molecules = [ Molecule.from_smiles(smiles) for smiles in molecule_smiles_list ] # list of Molecule objects to be prepared
qcfractal_dataset = generator.create_dataset(name, molecules)
# Write the dataset
qcfractal_dataset.write('qcfractal-dataset.json.gz')
# Get the molecules that were generated
molecules = qcfractal_dataset.molecules
# Write a PDF for the submitted molecules
qcfractal_dataset.write_pdf('qcfractal-dataset.pdf')

We can also add a CLI, though we want it to default to doing reasonable things but also allow overriding different options:

qcfractal-submit --molecules molecules.smi --name "My OpenFF dataset" --output qcfractal-dataset.json.gz --generator OptimizationDataset --fragment --expand-tautomers --compute_hessians --max_conformers 20

or also take the arguments from a JSON or YAML file for convenience

qcfractal-submit --json dataset.json

We could also think of adding the capability to actually submit this to QCFractal rather than having @dgasmith submit them on our behalf.

jchodera commented 5 years ago

I've nearly finished a prototype of this. My current thinking is that we should create a separate repo that would contain these tools---at least for now.

What do you think about qcfractal-submit, qcfractaltools, or openforcefield-qcfractal?

One possibility would be to use namespace-style packages and have this install openforcefield.qcfractaltools or some other subpackage within the openforcefield namespace.

jchodera commented 5 years ago

@davidlmobley : Here's an illustration of using the API in preparing the PDB Ligand Expo dataset from Genentech:

    for prefix in ['Neutral', 'Charged']:
        # Load input molecules
        oemols = cmiles.chemi.file_to_oemols(f'pubLigs{prefix}GoodDensity.sdf')

        # Create an OptimizationDataset
        factory = OptimizationDatasetFactory()
        factory.input_filters = ['BlockBuster']
        factory.enumerate_stereochemistry = False
        factory.enumerate_tautomers = False
        factory.fragment = False
        factory.max_conformers = 20
        factory.submit_filters = ['']
        factory.compute_hessians = False
        dataset = factory.create_dataset('PDB Ligand Expo {prefix} Genentech-filtered optimization dataset 1', oemols)
        dataset.write_json(f'pubLigs{prefix}GoodDensity-OptimizationDataset.json.gz')
        dataset.write_pdf(f'pubLigs{prefix}GoodDensity-OptimizationDataset.pdf')
        dataset.write_smiles(f'pubLigs{prefix}GoodDensity-OptimizationDataset.smi')

        # Create a TorsionDriveDataset
        factory = TorsionDriveDatasetFactory()
        factory.input_filters = ['BlockBuster']
        factory.enumerate_stereochemistry = False
        factory.enumerate_tautomers = True
        factory.fragment = True
        factory.submit_filters = ['Fragment']
        factory.max_conformers = 2
        factory.grid_spacing = 15
        dataset = factory.create_dataset('PDB Ligand Expo {prefix} Genentech-filtered torsion dataset 1', oemols)
        dataset.write_json(f'pubLigs{prefix}GoodDensity-TorsionDriveDataset.json.gz')
        dataset.write_pdf(f'pubLigs{prefix}GoodDensity-TorsionDriveDataset.pdf', highlight_torsions=True)
        dataset.write_smiles(f'pubLigs{prefix}GoodDensity-TorsionDriveDataset.smi', tag_torsion_atoms=False)

davidlmobley commented 5 years ago

On the whole this looks excellent, @jchodera . This would make it so I'd actually construct datasets myself when on a time crunch, rather than just putting the source material there and wondering what should happen next. :)

Should there be something which allows toggling of obtaining/storing Wiberg bond orders?

In terms of naming, I like all of the options you proposed:

What do you think about qcfractal-submit, qcfractaltools, or openforcefield-qcfractal? One possibility would be to use namespace-style packages and have this install openforcefield.qcfractaltools or some other subpackage within the openforcefield namespace.

That said, I think we DO want to avoid the tendency to create a new repo for every new thing, because every repo creates some amount of new maintenance burden, etc. That said, this doesn't seem to be something which would already have a home so it may be a good idea.

yudongqiu commented 5 years ago

@jchodera @davidlmobley I totally agree on the general plan, containing two steps:

Modularize and standardize the functionalities into a library. Provide a universal interface.
Seek approval from @dgasmith for authorization to submit and run calculations on public QCArchive server. (under "openff" tag)

For the past few days, I've been carefully thinking about the design of this openforcefield-qcfractal library. I tried to understand the use cases of this library, and realized three main difficulties that prevented us from implementing this in the first place:

For various dataset submissions, customized features are often needed to deal with edge cases and unexpected science from the input data, and it's very hard to predict all such needs. We don't want to force users to edit the source code and merge a PR before each dataset can be submitted.
There is potentially a huge waste of resource, if the submission is done with bugs or misuse of the library.
Certain submission depends on the finish of others. One example is that we want to only submit a hessian dataset, after the optimizations are done. (This might be a desirable feature from QCPortal, but for now we have to deal with it)

To address the above difficulties, I think it's necessary to expose a high-level structure of the submission procedure to the user. One way is providing a heavily-commented, clear-structured script to the user, and it has several benefits:

New users reading this script will get a general idea of the submission procedure. This lowers the learning curve and reduces the chance of mistakes.
This script contains standardized features with default configurations, while also support adding specialized features. If the feature is widely useful, it will still be easy to contribute such functionality to the library in a PR, but no need for waiting to merge.
As the submission procedure involves interfacing with multiple packages (fragmenter, cmiles, geomeTRIC, torsiondrive, qcportal, etc.), detailed configuration of each interface can be fine tuned in the script. I think this is actually more friendly than a long list of optional arguments.
The script can be executed at chosen time. The user can call the hessian submission script, after optimization dataset finished computing.
This script can be used to launch "test submissions" in local test servers, thus greatly reduce mistakes and bugs in the final submission. If needed, it is even possible to automatically spin up a snowflake QCFractal server and test the submission before sending it to the public server.
When tests fail or result in unexpected data, the script provides a clear entry for debugging and fix the issues.
This script can ensure maximum reproducibility of each submission, while also serves as a documentation of how the submission was done. Such documentation is important especially when customized features are used.
This script is also highly reusable for similar submissions that uses the same customized features.

Here is my proposed design, from a user's perspective:

Step 1: Install the library:

  conda install -c conda-forge openforcefield-qcfractal

Step 2: Use provided tool to generate a submission script:

  openff-qcf-submit.py --type optimization

This command will print a brief message about what feature is selected as default, such as

  Generating submit_optimization.py script
  1. <fragmenter> is selected as default conformer generation tool
     Max 10 conformers per molecule
  2. Default filters selected: [ "like-drug", "less than 50 heavy atoms" ].
  3. <geomeTRIC> is selected as the default optimizer.
  4. B3LYP-D3(bj) / DZVP is selected as the default QM method.

Step 3: Open and edit the generated "submit_optimization.py" script file.

Step 4: Run the script to test submission.

  python submit_optimization.py input_smiles.smi

Step 5: If everything is correct, run the final submission to production server

  python submit_optimization.py input_smiles.smi -c qcf_public.yaml

An example "submit_optimization.py" script I wrote will be pasted below.

For experienced users that want the maximum convenience, steps 2-5 can be combined into a single command. For new users, it will be highly suggested to follow these steps for each new type of submission. If reproducibility is needed, the user should upload the submission script together with input file to a repository.

yudongqiu commented 5 years ago

Here is an example of the generated submit_optimization.py script.

#!/usr/bin/env python

"""
OpenForceField QCFractal Submission Script
Type: OptimizationDataset
Name: OpenFF test set 1
Date: 09-09-2019
"""

import openforcefield-qcfractal as offqcf

def step1_expand_conformers(input_smiles):
    """
    expand conformers for input list of molecules

    Parameters
    ----------
    input_smiles: List[str]
        A list of smiles string each corresponding to one input molecule

    Returns
    -------
    molecules_data: Dict[str, Dict[str, any]]
        The dictionary maps the label of a molecule to data. e.g.
        {
            molecule_label1: {
                'conformers': [Conformer1, Conformer2, ..],
                'attributes': {
                    'canonical_isomeric_smiles': ...,
                }
            molecule_label2: ...,
        }

    Notes
    -----
    1. The "label" will be used as index for the entry in the resulting QCArchive dataset
    2. The "attributes" will be copied into each entry in the dataset, the content is optional
    """
    molecules_data = {}
    for smile in input_smiles:
        # Expand protonation states and stereoisomers, default using fragmenter
        molecule_states = offqcf.expand_states(smile)
        # Each state is assigned as a new molecule
        for molecule in molecule_states:
            molecule_attributes = offqcf.get_molecule_attributes(molecule)
            # the "canonical_isomeric_smiles" is used as label here
            molecule_label = molecule_attributes['canonical_isomeric_smiles']
            molecule_conformers = offqcf.generate_conformers(molecule, max=10)
            molecules_data[molecule_label] = {
                'conformers': molecule_conformers,
                'attributes': molecule_attributes,
            }
    return molecules_data

def step2_filter_molecules(molecules_data):
    """
    Filter out unwanted molecules_data

    Parameters
    ----------
    molecules_data: Dict[str, Dict[str, any]]
        output of step 1

    Returns
    -------
    filtered_molecules_data: Dict[str, Dict[str, any]]
        Same format as input, with filtered data

    """
    # run filter to keep molecules <= 50 heavy atoms
    filtered_molecules_data = offqcf.filter_50_heavy(molecules_data)
    # run filter to keep only drug-like molecules
    filtered_molecules_data = offqcf.filter_drug_like(filtered_molecules_data)
    # customized filters can be implemented here
    return filtered_molecules_data

def step3_submit(molecules_data, client_config=None):
    """
    Submit the Optimization Dataset

    Parameters
    ----------
    molecules_data: Dict[str, Dict[str, any]]
        output of step 2, same format as output of step 1
    """

    dataset_name = 'OpenFF test set 1'
    # geometric settings
    geometric_spec = {
        "coordsys": "dlc", # prevent translation & rotation moves
        "qccnv": True, # less tight convergence criteria
    }
    # qm specs
    qm_spec = {
        "method": "B3LYP-d3bj",
        "basis": "dzvp"'
    }
    # create a client (client_config=None will do test submission, replace with a real config to submit to public server)
    client = offqcf.create_client(client_config=client_config)
    # run the submission
    offqcf.create_optimization_dataset(client, dataset_name, geometric_spec, qm_spec, start_compute=True)

def main():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('input_smi', help='Input smi file contain a list of SMILES string')
    parser.add_argument('-c', '--client_config', help='public server client config file')
    args = parser.parse_args()

    with open(args.input_smi) as f:
        input_smiles = f.readlines()

    molecules_data = step1_expand_conformers(input_smiles)

    filtered_molecules_data = step2_filter_molecules(molecules_data)

    step3_submit(filtered_molecules_data, client_config=args.client_config)

if __name__ == '__main__':
    main()

davidlmobley commented 5 years ago

The other nice things about these thoughts is that would allow us (eventually, if our sponsors want to put developer time into this) allow us to make this process relatively agnostic as to cheminformatics toolkit, whereas right now our prep steps rely heavily on one specific toolkit.

dgasmith commented 5 years ago

One item that may also help which is (slowly) coming is a time estimator. So you can see when you have requested 10M core hours vs 10k core hours. Quantum chemistry is quite opaque to this.

openforcefield / qca-dataset-submission

Automating QCArchive dataset submission #53