openforcefield / qca-dataset-submission

Data generation and submission scripts for the QCArchive ecosystem.
Other
32 stars 6 forks source link

OpenFF QCArchive Dataset Submission

Dataset Lifecycle

All datasets submitted to QCArchive via this repository conform to the Dataset Lifecycle.

See STANDARDS.md for submission standards. Datasets must be submitted as pull requests.

User Quickstart

  1. Ensure git-lfs is installed on your local machine: https://git-lfs.github.com/

  2. To submit a new dataset, begin by cloning this repository:

    export GIT_LFS_SKIP_SMUDGE=1
    git clone git@github.com:openforcefield/qca-dataset-submission.git

    This will clone the repo, but avoid downloading existing LFS objects. If you wish to download all LFS objects, leave off the export GIT_LFS_SKIP_SMUDGE=1.

  3. Once cloned, create and switch to a new branch from master, then create a new directory in qca-dataset-submission/submissions/:

    git checkout -b <dataset-branch>
    mkdir qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0

    You will add all submission artifacts to this directory.

  4. Create and activate a new conda env with basic submission-preparation requirements with:

    conda env create -f qca-dataset-submission/devtools/prod-envs/qcarchive-user-submit.yaml
    conda activate qcarchive-user-submit
  5. Choose a starting notebook and README based on the type of dataset you wish to submit:

    Copy the notebook and README for the dataset you want into the directory you created.

    cp examples/<dataset-type>/* qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0
  6. Start up a Jupyter notebook with your new notebook:

    jupyter notebook qca-dataset-submission/submissions/YYYY-MM-DD-OpenFF-<DESCRIPTIVE-DATASET-NAME>-v1.0/generate-dataset.ipynb

    Edit the contents with appropriate metadata information, read in your molecules using the cells appropriate for your input data, and make any other modifications as needed for your specific needs.

  7. Copy generated metadata components into README. Write a reasonably-detailed high-level description of the submission at the top.

  8. Commit the following files in the submission directory you made:

    • your input files; please compress them if possible with e.g. bzip2
    • generate-dataset.ipynb
    • dataset.pdf
    • dataset.smi
    • dataset.json.bz2
  9. Push your branch to Github:

    git push origin <dataset-branch>
  10. Make a new PR for the branch. Validation will run automatically on your dataset.json.* file, indicating any potential issues prior to submission. Ask for help if you see validation failures you do not understand. Ping a reviewer in the comments.

  11. Once reviewed and approved, your submission will be merged and submitted to QCArchive! Computations specified by the submission will be performed on OpenFF-managed compute resources.

Creating a compute expansion

If you have already computed a dataset but want to re-compute it with a new QCSpec (e.g. new level of theory), you can do so using a compute expansion. This is faster than creating a new dataset, and explicitly links datasets with the same molecules and purpose. A compute expansion involves adding a file called compute.json to your original submission, which contains the dataset metadata (identical to the original dataset) and the new compute spec. This can be done manually, or programatically. The programatic description is provided below, with an example of the notebook and of the file.

  1. Create a new branch as described above, and navigate to the submission directory of the dataset you want to expand.
  2. Create a new jupyter notebook called generate-compute.ipynb example here.
  3. In the notebook, either download the original dataset and remove the molecules and original QCSpec, or re-create the dataset with the same metadata as the original (e.g. same name, description, etc) and skip the molecule addition step.
    • Please note that the default compute_tag is openff; if you need to use a different one, please add it explicitly to the dataset at this step, as the compute.json file overrides the compute tag added manually to the PR. If you do need to change the compute tag after submission, you can change it by updating the label on the PR and the change will take effect when the error cycling action runs next.
  4. Add the new QCSpec to the dataset, and save the dataset to compute.json, example here.
  5. Add the additional compute spec to the submission's README.md file.
  6. Add the generate-compute.ipynb and compute.json files to the submission's QCSubmit Manifest entry in the README.md file.
  7. Proof the submission and open a PR. Dataset validation will run automatically.
  8. Once the dataset is validated, request a review, and once approved, your compute expansion will be submitted!

The Lifecycle of a Dataset Submission

All Open Force Field datasets submitted to QCArchive undergo well-defined lifecycle.

Dataset Lifecycle

Each labeled rectangle in the lifecycle represents a state. A submission PR changes state according to the arrows. Changes in state may be performed by automation or manually by a human when certain critera are met.

The lifecycle process is described below, with [bracketed] items indicating the agent of action, one of:

  1. A PR is created against qca-dataset-submission by a submitter.

    • the template is filled out with informational sections according to the PR template
    • [GHA] validation operates on all dataset*.json files found in the PR; performs validation checks
      • comment made based on validation checks
      • reruns on each push
  2. Add card for the PR to Dataset Tracking board.

  3. When the submission is ready to be submitted to public QCArchive (validations pass, submitters and reviewers satisfied), PR is merged.

    • [Board] PR card will move to state "Queued for Submission" immediately.

    • [GHA] lifecycle-backlog will move PR card to state "Queued for Submission" if merged and in state "Backlog"

    • [GHA] lifecycle-submission will attempt to submit the dataset

      • if successful, will move card to state "Error Cycling"; add comment to PR
      • if failed, will keep card queued; add comment to PR; attempt again next execution
    • [Human] Submit worker jobs on a server to begin compute. If using Nautilus, carefully monitor utilization and scale down resources as jobs finish.

  4. COMPLETE, INCOMPLETE, ERROR numbers reported for Optimizations, TorsionDrives

  5. PR will remain in state "Error Cycling" until moved to "Requires Scientific Review" or until all tasks COMPLETE

    • [Human] if errors appear persistent, move to state "Requires Scientific Review"
    • discussion should be had on PR for next version
    • [Human] once decided, state moved to "End of Life"
    • [Human] ensure all worker jobs have been shut down.
  6. [GHA] lifecycle-end-of-life will add tag 'end-of-life' to dataset in QCArchive for PR in "End of Life"

  7. [GHA] lifecycle-archived-complete will add tag 'archived-complete' to dataset in QCArchive for PR in "Archived/Complete"

Management Touchpoints

In addition to the states given above, there are additional touchpoints available for managing dataset submissions:

  1. The tracking label is the "on/off" switch for automation via Github Actions. To disable all automation on a submission PR, remove this label. To enable automation, add the label.

  2. Submission priority can be changed by adding one of the following labels:

    • priority-high: highest priority
    • priority-normal: normal priority
    • priority-low: lowest priority
  3. Submission routing to QCFractal managers on different compute resources can be accomplished with compute tags. Add a label like compute-<tagname> to set the compute tag for all QCArchive tasks associated with a submisison. Be sure to coordinate with QCFractal manager admins to ensure your chosen compute tag is being served on the expected resources. This mechanism can also be used to "dead-letter" computations that are no longer desired by setting a compute tag that no manager will service.

  4. The order of a submission PR in a Dataset Tracking column matters. Submissions higher in a column will be operated on first by all Github Action automation. For example, if you want to error cycle a submission before any others so it has a higher chance of being pulled by idle manager workers, place it at the top of the Error Cycling column.

Dude where's my Dataset?

Finding the source of a dataset in QCArchive can be difficult; here we offer a mapping between a dataset in QCArchive and the folder which contains its inputs including a quick overview of some metadata and the status of the dataset. Note that new datasets submitted using QCSubmit know where they were created and have a long_description_url in the metadata which points directly to their home folder in this repository.

Status

The status only refers to the default specification which is required for all of our datasets. Currently this is B3LYP-D3BJ/DZVP.

Key:

Complete 100% of all default spec jobs have been complete.

Error some of the jobs in the dataset contain errors which may prevent the jobs from finishing, this could be something like a linear torsiondrive.

Running the dataset is currently running and may have some incomplete jobs.

Basic Datasets

These are currently used to compute properties of a minimum energy conformation (Hessians, wavefunctions, etc.), usually derived from completed optimization datasets.

QCArchive Dataset Folder Description Elements Status
OpenFF Optimization Set 1 2019-07-09-OpenFF-Optimization-Set Hessian calculations. Cl, S, C, F, O, H, N Complete
OpenFF NCI250K Boron 1 2019-07-05 OpenFF NCI250K Boron 1 Hessian calculations. Cl, Br, S, C, F, B, O, H, N Complete
OpenFF Discrepancy Benchmark 1 2019-07-05 eMolecules force field discrepancies 1 Hessian calculation. Cl, Br, S, C, F, P, I, O, H, N Error
OpenFF Gen 2 Opt Set 1 Roche 2020-03-20-OpenFF-Gen-2-Optimization-Set-1-Roche Hessian calculation. Cl, S, C, F, O, H, N Complete
OpenFF Gen 2 Opt Set 2 Coverage 2020-03-20-OpenFF-Gen-2-Optimization-Set-2-Coverage The hessian calculations. Cl, Br, S, C, F, P, I, O, H, N Error
OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy 2020-03-20-OpenFF-Gen-2-Optimization-Set-3-Pfizer-Discrepancy Hessian calculations. Cl, F, C, S, O, H, N Complete
OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy 2020-03-20-OpenFF-Gen-2-Optimization-Set-4-eMolecules-Discrepancy Hessian calculations. Cl, Br, S, C, F, P, I, O, H, N Complete
OpenFF Gen 2 Opt Set 5 Bayer 2020-03-20-OpenFF-Gen-2-Optimization-Set-5-Bayer Hessian calculations. Si, Cl, Br, F, C, S, O, H, N Error
OpenFF VEHICLe Set 1 2019-07-02 VEHICLe optimization dataset Hessian calculations. S, C, O, H, N Error
SMIRNOFF Coverage Set 1 2019-06-25-smirnoff99Frost-coverage Hessian calculations. Cl, Br, S, C, F, P, I, O, H, N Error
OpenFF ESP Fragment Conformers v1.0 2022-01-16-OpenFF-ESP-Fragment-Conformers-v1.0 ESP Calculations N, Cl, C, H, P, Br, O, F, S Running
OpenFF Theory Benchmarking Single Point Energies v1.0 2021-09-06-theory-bm-single-points Single Point Energy dataset for the final optimized geometries from MP2/heavy-aug-cc-pVTZ torsiondrives. Cl, F, C, S, O, H, N, P Running
TorsionNet500 Single Points Dataset v1.0 2021-11-09-TorsionNet500-single-points Single point energies of final geometries of TorsionNet500 dataset. H, O, F, S, N, Cl, C Running
SPICE DES Monomers Single Points Dataset v1.1 2021-11-15-QMDataset-DES-monomers-single-points Single point energy calculation of DES monomers. I, C, Br, P, Cl, H, S, O, F, N Complete
SPICE Solvated Amino Acids Single Points Dataset v1.1 2021-11-08-QMDataset-Solvated-Amino-Acids-single-points Single point energy calculation of solvated amino acids. N, S, O, C, H Complete
SPICE DES370K Single Points Dataset v1.0 2021-11-08-QMDataset-DES370K-single-points SPICE single point dataset for ML applications. 'N', 'O', 'Mg', 'H', 'F', 'K', 'Br', 'Na', 'P', 'Cl', 'I', 'Ca', 'S', 'Li', 'C' Complete
SPICE DES370K Single Points Dataset Supplement v1.0 2022-02-18-QMDataset-DES370K-single-points-supplement SPICE single point dataset for ML applications. F, H, Cl, S, I, Br, N, Li, O, C, Na Running
SPICE Dipeptides Single Points Dataset v1.2 2021-11-08-QMDataset-Dipeptide-single-points SPICE single point dataset for ML applications. C ,N ,O ,H ,S Complete
SPICE PubChem Set 1 Single Points Dataset v1.2 2021-11-08-QMDataset-pubchem-set1-single-points SPICE single point dataset for ML applications. 'O', 'Cl', 'N', 'C', 'P', 'Br', 'S', 'F', 'I', 'H' Running
SPICE PubChem Set 2 Single Points Dataset v1.2 2021-11-09-QMDataset-pubchem-set2-single-points SPICE single point dataset for ML applications. 'H', 'P', 'C', 'Cl', 'Br', 'N', 'F', 'S', 'O', 'I' Running
SPICE PubChem Set 3 Single Points Dataset v1.2 2021-11-09-QMDataset-pubchem-set3-single-points SPICE single point dataset for ML applications. 'N', 'C', 'S', 'Cl', 'Br', 'F', 'P', 'I', 'H', 'O' Running
SPICE PubChem Set 4 Single Points Dataset v1.2 2021-11-09-QMDataset-pubchem-set4-single-points SPICE single point dataset for ML applications. 'N', 'S', 'Br', 'O', 'C', 'F', 'H', 'I', 'Cl', 'P' Running
SPICE PubChem Set 5 Single Points Dataset v1.2 2021-11-09-QMDataset-pubchem-set5-single-points SPICE single point dataset for ML applications. 'F', 'H', 'S', 'Br', 'Cl', 'N', 'P', 'C', 'I', 'O' Running
SPICE PubChem Set 6 Single Points Dataset v1.2 2021-11-09-QMDataset-pubchem-set6-single-points SPICE single point dataset for ML applications. 'Cl', 'O', 'N', 'H', 'C', 'P', 'S', 'F', 'Br', 'I' Running
OpenFF ESP Industry Benchmark Set v1.1 2022-02-02-OpenFF-ESP-Industry-Benchmark-Set-v1.1-single-point HF/6-31G* conformers of public industry benchmark molecules. N, F, Cl, C, H, O, Br, P, S Running
SPICE Ion Pairs Single Points Dataset v1.1 2022-06-08-QMDataset-ion-pairs SPICE single point dataset for ML applications. 'F', 'Cl', 'Li', 'Na', 'Br', 'K', 'I' Running
RNA Single Point Dataset v1.0 2022-07-07-RNA-basepair-triplebase-single-points RNA single point dataset consisting of RNA basepairs and triple bases. 'P', 'N', 'O', 'C', 'H' Running
RNA Trinucleotide Single Point Dataset v1.0 2022-10-21-RNA-trinucleotide-single-points Single point energy calculations of RNA basepairs and triple bases 'O', 'N', 'C', 'H', 'P' Running
RNA Nucleoside Single Point Dataset v1.0 2023-03-09-RNA-nucleoside-single-points Single point energy calculations of RNA nucleosides without O5' hydroxyl atom 'O', 'N', 'C', 'H' Running
OpenFF multi-Br ESP Fragment Conformers v1.1 2023-11-30-OpenFF-multi-Br-ESP-Fragment-Conformers-v1.1-single-point Single point ESP calculations Br, C, F, H, N, O, P, S
MLPepper RECAP Optimized Fragments v1.0 2024-07-26-MLPepper-RECAP-Optimized-Fragments-v1.0 Single point property calculations for charge models P ,B ,Cl ,Br ,C ,H ,I ,F ,O ,N ,Si ,S
OpenFF NAGL2 ESP Timing Benchmark v1.0 2024-09-06-OpenFF-NAGL2-ESP-Timing-Benchmark-v1.0 Single point ESP calculations for timing/memory benchmarking 'P', 'S', 'N', 'C', 'Cl', 'F', 'Br', 'O', 'H', 'I'
OpenFF NAGL2 ESP Timing Benchmark v1.1 2024-09-18-OpenFF-NAGL2-ESP-Timing-Benchmark-v1.1 Single point ESP calculations for timing/memory benchmarking 'P', 'S', 'N', 'C', 'Cl', 'F', 'Br', 'O', 'H', 'I'
OpenFF Sulfur Hessian Training Coverage Supplement v1.0 2024-09-18-OpenFF-Sulfur-Hessian-Training-Coverage-Supplement-v1.0 Additional Hessian training data for Sage sulfur and phosphorus parameters (from 'OpenFF Sulfur Optimization Training Coverage Supplement v1.0') O, S, C, Cl, P, N, F, Br, H
OpenFF Aniline Para Hessian v1.0 2024-10-07-OpenFF-Aniline-Para-Hessian-v1.0 Hessian single points for the final molecules in the OpenFF Aniline Para Opt v1.0 dataset 'O', 'Cl', 'S', 'Br', 'H', 'F', 'N', 'C'
OpenFF Gen2 Hessian Dataset Protomers v1.0 2024-10-07-OpenFF-Gen2-Hessian-Dataset-Protomers-v1.0 Hessian single points for the final molecules in the OpenFF Gen2 Optimization Dataset Protomers v1.0 dataset 'H', 'C', 'Cl', 'P', 'F', 'Br', 'O', 'N', 'S'
MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0 2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0 Set of diverse iodine containing molecules with a number of calculated electrostatic properties. Br, Cl, S, B, O, Si, C, N, I, P, H, F

Optimization Datasets

These are currently used to find a minimum energy conformation of a molecule.

QCArchive Dataset Folder Description Elements Status
OpenFF Optimization Set 1 2019-05-16-Roche-Optimization_Set Geometry optimizations of a set of Roche molecules for forcefield fitting. Cl, S, C, F, O, H, N Complete
SMIRNOFF Coverage Set 1 2019-06-25-smirnoff99Frost-coverage An optimization dataset the excises all parameters in Smirnoff99Frost. Cl, Br, S, C, F, P, I, O, H, N Error
OpenFF VEHICLe Set 1 2019-07-02 VEHICLe optimization dataset VEHICLe (virtual exploratory heterocyclic library) dataset of 24,867 aromatic heterocyclic rings with expanded stereochemistry. S, C, O, H, N Error
OpenFF Discrepancy Benchmark 1 2019-07-05 eMolecules force field discrepancies 1 A set of molecules whose optimized structures differs across forcefields. Cl, Br, S, C, F, P, I, O, H, N Error
OpenFF NCI250K Boron 1 2019-07-05 OpenFF NCI250K Boron 1 This database is a subset of boron-containing compounds from the NCI250K (Release 1 - Oct 1999) compound dataset. Cl, Br, S, C, F, B, O, H, N Complete
OpenFF Ehrman Informative Optimization v0.2 2019-09-06-OpenFF-Informative-Set This provides an optimization dataset based on an initial batch of Jordan Ehrman's analysis of eMolecules, pulling out molecules with minimized geometries which are substantially different in different force fields. Cl, Br, S, C, F, P, I, O, H, N Error
Pfizer discrepancy optimization dataset 1 2019-09-07-Pfizer-discrepancy-optimization-dataset-1 This database is a subset of 100 challenging small molecule fragments where HF/minix followed by B3LYP/6-31G*//B3LYP/6-31G** differed substantially from OPLS3e. Cl, F, C, S, O, H, N Complete
FDA optimization dataset 1 2019-09-08-fda-optimization-dataset-1 he ZINC15 FDA dataset was retrieve in mol2 format on Sun Sep 8 20:44:34 EDT 2019 via: http://zinc.docking.org/substances/subsets/fda.mol2?count=all Cl, Br, F, C, S, P, I, O, H, N Error
Kinase Inhibitors: WBO Distributions 2019-11-27-kinase-inhibitor-optimization Geometry optimization of kinase inhibitor conformers to explore WBO conformation dependency. Cl, Br, S, C, F, P, I, O, H, N Complete
OpenFF Gen 2 Opt Set 1 Roche 2020-03-20-OpenFF-Gen-2-Optimization-Set-1-Roche 2nd generation optimization dataset for bond and valence parameter fitting. Cl, S, C, F, O, H, N Complete
OpenFF Gen 2 Opt Set 2 Coverage 2020-03-20-OpenFF-Gen-2-Optimization-Set-2-Coverage 2nd generation optimization dataset for bond and valence parameter fitting. Cl, Br, S, C, F, P, I, O, H, N Error
OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy 2020-03-20-OpenFF-Gen-2-Optimization-Set-3-Pfizer-Discrepancy 2nd generation optimization dataset for bond and valence parameter fitting. Cl, F, C, S, O, H, N Complete
OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy 2020-03-20-OpenFF-Gen-2-Optimization-Set-4-eMolecules-Discrepancy 2nd generation optimization dataset for bond and valence parameter fitting Cl, Br, S, C, F, P, I, O, H, N Complete
OpenFF Gen 2 Opt Set 5 Bayer 2020-03-20-OpenFF-Gen-2-Optimization-Set-5-Bayer 2nd generation optimization dataset for bond and valence parameter fitting. Si, Cl, Br, F, C, S, O, H, N Error
OpenFF Protein Fragments v1.0 2020-07-06-OpenFF-Protein-Fragments-Initial This is the initial test of running constrained optimizations on various protein fragments prepared by David Cerutti. Here we just have ALA as the central residue. H, C, O, N Complete
OpenFF Protein Fragments v2.0 2020-08-12-OpenFF-Protein-Fragments-version2 This is the full protein fragment dataset (version2) consisting of constrained optimizations on various protein fragments prepared by David Cerutti. We have 12 central residues which are capped with a combination of different terminal residues. S, C, O, H, N Error
OpenFF Sandbox CHO PhAlkEthOH v1.0 2020-09-18-OpenFF-Sandbox-CHO-PhAlkEthOH The molecules are from the AlkEthOH and PhEthOH datasets originally used to build the smirnoff99Frosst parameters. The AlkEthOH was taken from here H, C, O Running
OpenFF Industry Benchmark Season 1 v1.0 2021-03-30-OpenFF-Industry-Benchmark-Season-1-v1.0 The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmark N, F, Cl, C, H, O, Br, P, S Error
OpenFF Industry Benchmark Season 1 v1.1 2021-06-04-OpenFF-Industry-Benchmark-Season-1-v1.1 The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmark N, F, Cl, C, H, O, Br, P, S Running
OpenFF Theory Benchmarking Constrained Optimization Set MP2 heavy-aug-cc-pVTZ v1.1 2020-11-25-theory-bm-set-mp2-heavy-aug-cc-pvtz This is a Constrained Optimization dataset for benchmarking MP2/heavy-aug-cc-pVTZ. Running
OpenFF Industry Benchmark Season 1 - MM v1.1 2021-07-28-OpenFF-Industry-Benchmark-Season-1-MM-v1.1 The combination of all publicly chosen compound sets by industry partners from the OpenFF season 1 industry benchmark; MM computations starting from QM-optimized geometries. N, F, Cl, C, H, O, Br, P, S Running
OpenFF RESP Polarizability Optimizations v1.0 2021-10-01-OpenFF-resppol-mp2-single-point A data set used for training ESP-fitting based typed atomic polarizabilities with a direct approximation. N, C, H, O Running
OpenFF RESP Polarizability Optimizations v1.1 2021-10-01-OpenFF-resppol-mp2-single-point A data set used for training ESP-fitting based typed atomic polarizabilities with a direct approximation. N, C, H, O Running
SPICE Dipeptides Optimization Dataset v1.0 2021-11-11-Dipeptide-optimization-set Optimization set created from the smiles of SPICE Dipeptide dataset. N, C, H, O, S Running
OpenFF Gen 2 Optimization Dataset Protomers v1.0 2021-12-21-OpenFF-Gen2-Optimization-Set-Protomers Optimization set created from the smiles of missing protomers in Gen 2 optimization sets. O, F, S, Br, Cl, C, P, H, I, N Running
OpenFF ESP Industry Benchmark Set v1.0 2022-02-02-OpenFF-ESP-Industry-Benchmark-Set-v1.0-optimization-set HF/6-31G* conformers of public industry benchmark molecules. N, F, Cl, C, H, O, Br, P, S Running
OpenFF Protein Capped 1-mers 3-mers Optimization Dataset v1.0 2022-05-30-OpenFF-Protein-Capped-1-mers-3-mers-Optimization Optimization dataset for protein capped 1-mers Ace-X-Nme and capped 3-mers Ace-Y-X-Y-Nme with Y = {Ala, Val} and X = 26 canonical amino acids with common protomers/tautomers (Ash, Cyx, Glh, Hid, Hip, and Lyn) H, C, N, O, S
OpenFF Iodine Chemistry Optimization Dataset v1.0 2022-07-27-OpenFF-iodine-optimization-set Optimization set created from Gen1 and Gen2 molecules containing iodine 'C', 'F', 'O', 'H', 'Br', 'Cl', 'N', 'I', 'S'
OpenFF multi-Br ESP Fragment Conformers v1.0 2023-11-02-OpenFF-multi-Br-ESP-Fragment-Conformers-v1.0 Optimization set created from 2022-01-16-OpenFF-ESP-Fragment-Conformers-v1.0 by selecting molecules with multiple Cl atoms and replacing them with Br Br, C, F, H, N, O, P, S
XtalPi Shared Fragments OptimizationDataset v1.0 2024-01-30-xtalpi-shared-fragments-optimization-v1.0 Representative optimization molecules used to fit XFF C, H, Cl, Br, S, O, F, N, P
XtalPi 20-percent Fragments OptimizationDataset v1.0 2024-04-02-xtalpi-20-percent-fragments-optimization-v1.0 Larger (20%) representative subset of molecules used to fit XFF Cl, P, Br, I, H, C, B, Si, O, N, F, S
OpenFF Torsion Benchmark Supplement Optimization Dataset v1.0 2024-04-18-OpenFF-Torsion-Benchmark-Supplement-Optimization-Dataset-v1.0 Additional optimizations for benchmarking Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work H, C, N, O, F, P, S, Cl, Br
OpenFF Torsion Multiplicity Optimization Training Coverage Supplement v1.0 2024-06-20-OpenFF-Torsion-Multiplicity-Optimization-Training-Coverage-Supplement-v1.0 Additional optimization training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work C, Cl, S, O, H, P, N, Br
OpenFF Torsion Multiplicity Optimization Benchmarking Coverage Supplement v1.0 2024-06-24-OpenFF-Torsion-Multiplicity-Optimization-Benchmarking-Coverage-Supplement-v1.0 Additional optimization benchmarking data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work Cl, H, I, S, O, N, Br, C, P
OpenFF Iodine Fragment Opt v1.0 2024-09-10-OpenFF-Iodine-Fragment-Opt-v1.0 B3LYP-D3BJ/DZVP optimized conformers for a variety of I-containing fragment molecules C, O, I, S, F, Br, Cl, N, H
OpenFF Sulfur Optimization Training Coverage Supplement v1.0 2024-09-11-OpenFF-Sulfur-Optimization-Training-Coverage-Supplement-v1.0 Additional optimization training data for Sage sulfur and phosphorus parameters C, S, F, O, H, Cl, Br, P, N
OpenFF Sulfur Optimization Benchmarking Coverage Supplement v1.0 2024-09-18-OpenFF-Sulfur-Optimization-Benchmarking-Coverage-Supplement-v1.0 Additional optimization benchmarking data for Sage sulfur and phosphorus parameters S, P, Cl, C, N, O, H, Br, F
OpenFF Lipid Optimization Training Supplement v1.0 2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0 Additional optimization training data for Sage from representative LIPID MAPS fragments I, Br, O, H, P, C, N, Cl, F, S

TorsionDrive Datasets

These are currently used perform a complete rotation of one or more selected bonds, where optimizations are performed over a discrete set of angles.

QCArchive Dataset Folder Description Elements Status
Fragment Stability Benchmark 2019-03-06-Fragmenter_Stability-Benchmark Examination of different fragmentation schemes. Cl, F, C, P, I, O, H, N Error
OpenFF Group1 Torsions 2019-05-01-OpenFF-Group1-Torsions A collection of torsion drives for forcefield fitting. Cl, F, C, S, O, H, N Error
SMIRNOFF Coverage Torsion Set 1 2019-07-01-smirnoff99Frost-coverage-torsion Set of small molecules that use all smirnoff99Frost parameters. C', Br, S, C, F, P, I, O, H, N Error
OpenFF Substituted Phenyl Set 1 2019-07-25-phenyl-set A set of substituted phenyl torsiondrives. Cl, Br, F, C, I, O, H, N Error
Pfizer discrepancy torsion dataset 1 2019-09-07-Pfizer-discrepancy-torsion-dataset-1 This database is a subset of 100 challenging small molecule fragments where HF/minix followed by B3LYP/6-31G*//B3LYP/6-31G** differed substantially from OPLS3e. Cl, F, C, S, O, H, N Error
TorsionDrive Paper 2019-11-07-TorsionDrive-Paper Torsion Drives to explore wavefront propagation for the TorsionDrive paper. C, H, O Error
OpenFF Primary Benchmark 1 Torsion Set 2019-12-05-OpenFF-Benchmark-Primary-1-torsion Validation of optimized force field torsion parameters. Cl, Br, F, C, S, O, H, N Error
OpenFF Primary Benchmark 2 Torsion Set 2020-01-17-OpenFF-Benchmark-Full-1-torsion Validation of optimized force field torsion parameters. Cl, Br, S, C, F, P, I, O, H, N Error
OpenFF Group1 Torsions 2 2020-01-31-OpenFF-Group1-Torsions-2 Generation of additional data for fitting of newly added torsion terms. H, C, O, N Complete
OpenFF Group1 Torsions 3 2020-02-10-OpenFF-Group1-Torsions-3 Generation of additional data for fitting of t128 and t129 H, C, O, N Error
OpenFF Gen 2 Torsion Set 1 Roche 2020-03-12-OpenFF-Gen-2-Torsion-Set-1-Roche Design 2nd generation torsion dataset for valence parameter fitting. F, C, S, O, H, N Error
OpenFF Gen 2 Torsion Set 2 Coverage 2020-03-12-OpenFF-Gen-2-Torsion-Set-2-Coverage Design 2nd generation torsion dataset for valence parameter fitting. Cl, Br, F, C, S, P, I, O, H, N Error
OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy 2020-03-12-OpenFF-Gen-2-Torsion-Set-3-Pfizer-Discrepancy Design 2nd generation torsion dataset for valence parameter fitting S, C, F, O, H, N Running
OpenFF Gen 2 Torsion Set 4 eMolecules Discrepancy 2020-03-12-OpenFF-Gen-2-Torsion-Set-4-eMolecules-Discrepancy Design 2nd generation torsion dataset for valence parameter fitting. Cl, Br, F, C, S, P, I, O, H, N Error
OpenFF Gen 2 Torsion Set 5 Bayer 2020-03-12-OpenFF-Gen-2-Torsion-Set-5-Bayer Design 2nd generation torsion dataset for valence parameter fitting. Cl, Br, F, C, S, O, H, N Error
OpenFF Gen 2 Torsion Set 6 supplemental 2020-03-12-OpenFF-Gen-2-Torsion-Set-6-supplemental Design 2nd generation torsion dataset for valence parameter fitting. S, C, O, H, N Error
OpenFF Gen 2 Torsion Set 1 Roche 2 2020-03-23-OpenFF-Gen-2-Torsion-Set-1-Roche-2 Design 2nd generation torsion dataset for valence parameter fitting. Cl, F, C, S, O, H, N Error
OpenFF Gen 2 Torsion Set 2 Coverage 2 2020-03-23-OpenFF-Gen-2-Torsion-Set-2-Coverage-2 Design 2nd generation torsion dataset for valence parameter fitting. Cl, Br, F, C, S, P, I, O, H, N Error
OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy 2 2020-03-23-OpenFF-Gen-2-Torsion-Set-3-Pfizer-Discrepancy-2 Design 2nd generation torsion dataset for valence parameter fitting. S, C, F, O, H, N Complete
OpenFF Gen 2 Torsion Set 4 eMolecules Discrepancy 2 2020-03-23-OpenFF-Gen-2-Torsion-Set-4-eMolecules-Discrepancy-2 Design 2nd generation torsion dataset for valence parameter fitting. Cl, Br, F, C, S, P, I, O, H, N Error
OpenFF Gen 2 Torsion Set 5 Bayer 2 2020-03-26-OpenFF-Gen-2-Torsion-Set-5-Bayer-2 Design 2nd generation torsion dataset for valence parameter fitting. Cl, Br, F, C, S, O, H, N Error
OpenFF Gen 2 Torsion Set 6 supplemental 2 2020-03-26-OpenFF-Gen-2-Torsion-Set-6-supplemental-2 Design 2nd generation torsion dataset for valence parameter fitting. Br S, C, F, O, H, N Error
OpenFF Fragmenter Validation 1.0 2020-04-28-Fragmenter-test Examination of different fragmentation schemes. Cl, S, C, P, I, O, H, N Error
OpenFF DANCE 1 eMolecules t142 v1.0 2020-06-01-DANCE-1-eMolecules-t142-selected Molecules selected from the eMolecules database by DANCE to improve t142 parameterization in smirnoff99Frosst. Cl, Br, F, C, S, O, H, N Error
OpenFF Rowley Biaryl v1.0 2020-06-17-OpenFF-Biaryl-set This is a TorsionDrive dataset consisting of biaryl torsions provided by Christopher Rowley. Originally used to benchmark parsley, but could also be useful for fitting. S, C, O, H, N Running
OpenFF-benchmark-ligand-fragments-v1.0 2020-07-27-OpenFF-Benchmark-Ligands This is a torsiondrive dataset created from the OpenFF FEP benchmark dataset. The ligands are fragmented before having key torsions driven. Cl, Br, S, C, F, I, O, H, N Running
OpenFF Theory Benchmarking Set B3LYP-D3BJ DZVP v1.0 2020-07-27-theory-bm-set-b3lyp-d3bj-dzvp This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels. Cl, F, C, S, P, O, H, N Complete
OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVP v1.0 2020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvp This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels. Cl, F, C, S, P, O, H, N Complete
OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVPD v1.0 2020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvpd This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels. Cl, F, C, S, P, O, H, N Error
OpenFF Theory Benchmarking Set B3LYP-D3BJ def2-TZVPP v1.0 2020-07-30-theory-bm-set-b3lyp-d3bj-def2-tzvpp This is a TorsionDrive dataset consisting of 36 1-D torsions selected for benchmarking different QM levels. Cl, F, C, S, P, O, H, N Complete
OpenFF Protein Fragments TorsionDrives v1.0 2020-09-16-OpenFF-Protein-Fragments-TorsionDrives This is a protein fragment dataset consisting of torsion drives on various protein fragments prepared by David Cerutti. We have 12 central residues capped with a combination of different terminal residues. We drive the following angles for each fragment: - omega - phi - psi - chi1 (if applicable) - chi2 (if applicable). S, C, O, H, N Error
OpenFF WBO Conjugated Series v1.0 2021-01-25-OpenFF-Conjugated-Series This is a torsion drive dataset that consists of various chemistries that probe a range of conjugated bonds. The goal of this dataset is to develop WBO interpolated torsions for the OpenFF force field. S, C, O, H, N Error
OpenFF Amide Torsion Set v1.0 2021-03-23-OpenFF-Amide-Torsion-Set-v1.0 Amides, thioamides and amidines diversely functionalized. S, C, O, H, N Running
OpenFF Aniline Para Opt v1.0 2021-04-02-OpenFF-Aniline-Para-Opt-v1.0 Optimizations of diverse, para-substituted aniline derivatives. Br, C, O, N, S, H, Cl, F Running
OpenFF Gen3 Torsion Set v1.0 2021-04-09-OpenFF-Gen3-Torsion-Set-v1.0 This dataset is a simple-molecule-only torsiondrive dataset, aiming to avoid issue of torsion parameter contamination by large internal non-bonded interactions during a valece parameter optimization. Molecules with one effective rotating bond were generate by combining two simple substituents, which were identified by fragmenting small drug like molecules. Torsions from the generated molecule set were selected using clustering method, in a way that the dataset can allow a chemical diversity of molecules training each torsion parameter. F ,N ,H ,Cl ,P ,S ,O ,Br ,C Running
OpenFF Aniline 2D Impropers v1.0 2021-03-29-OpenFF-Aniline-2D-Impropers-v1.0 This dataset contains a set of aniline derivatives which have para-substituted groups of varying electron donating and withdrawing properties. This dataset was curated in an effort to improve and understand improper torsions in force fields. We will scan the improper and proper angle simultaneously to better understand the coupling and energetics of these torsions. O, C, S, H, N Running
OpenFF BCC Refit Study COH v2.0 2021-06-22-OpenFF-BCC-Refit-Study-COH-v2.0 A data set curated for the initial stage of the on-going OpenFF study which aims to co-optimize the AM1BCC bond charge correction (BCC) parameters against an experimental training set of density and enthalpy of mixing data points and a QM training set of electric field data. The initial data set is limited to only molecules composed of C, O, H. This limited scope significantly reduces the number of BCC parameters which must be retrained, thus allowing for easier convergence of the initial optimizations. The included molecules were combinatorially generated to cover a range of alcohol, ether, and carbonyl containing molecules. O, C, S, H, N Running
OpenFF-benchmark-ligand-fragments-v2.0 2021-08-10-OpenFF-JACS-Fragments-v2.0 This is a torsiondrive dataset created from the OpenFF FEP benchmark dataset. The ligands are fragmented using openff-fragmenter with both ambertools and openeye before having key torsions driven. S, N, Br, C, H, O, Cl, F, I Running
OpenFF-Protein-Dipeptide-2D-TorsionDrive-v2.1 2021-11-18-OpenFF-Protein-Dipeptide-2D-TorsionDrive Two-dimensional TorsionDrives on phi and psi for dipeptides of the 20 canonical amino acids and 6 alternate protomers/tautomers. H, C, N, O, S
OpenFF-Protein-Capped-1-mer-Sidechains-v1.3 2022-02-10-OpenFF-Protein-Capped-1-mer-Sidechains Two-dimensional TorsionDrives on chi1 and chi2 for capped 1-mers of amino acids with a rotatable bond in the sidechain. H, C, N, O, S
OpenFF-Protein-Capped-3-mer-Backbones-v1.0 2022-05-30-OpenFF-Protein-Capped-3-mer-Backbones Two-dimensional TorsionDrives on phi and psi for capped 3-mers Ace-Y-X-Y-Nme with Y = {Ala, Val}. H, C, N, O, S
OpenFF-multiplicity-correction-torsion-drive-data-v1.1 2022-04-29-OpenFF-multiplicity-correction-torsion-drive-data-v1.1 A torsiondrive dataset created to correct multiplicity issues in the force field. 'S', 'P', 'O', 'C', 'H', 'N' Running
OpenFF-Protein-Capped-3-mer-Omega-v1.0 2023-02-06-OpenFF-Protein-Capped-3-mer-Omega TorsionDrives on omega for capped 3-mers Ace-Ala-X-Ala-Nme. H, C, N, O, S
XtalPi Shared Fragments TorsiondriveDataset v1.0 2024-01-30-xtalpi-shared-fragments-torsiondrive-v1.0 Representative torsion scan molecules used to fit XFF C, H, Cl, Br, S, O, F, N, P
OpenFF Torsion Coverage Supplement v1.0 2024-02-29-OpenFF-Torsion-Coverage-Supplement-v1.0 Additional TorsionDrives to improve coverage for Sage 2.1.0 proper torsions and new parameters from the torsion multiplicity work C, Cl, F, H, N, O, S
OpenFF-RNA-Dinucleoside-Monophosphate-TorsionDrives-v1.0 2024-03-26-OpenFF-RNA-Dinucleoside-Monophosphate-TorsionDrives TorsionDrives of non-ring backbone, glycosidic, and hydroxyl dihedrals in RNA XpY 2-mers. H, C, N, O, P
XtalPi 20-percent Fragments TorsiondriveDataset v1.0 2024-04-02-xtalpi-20-percent-fragments-torsiondrive-v1.0 Torsion scans of larger representative subset (20%) of molecules used to fit XFF O, Br, I, Si, B, C, P, S, Cl, H, N, F
OpenFF Torsion Drive Supplement v1.0 2024-04-17-OpenFF-Torsion-Drive-Supplement-v1.0 Additional TorsionDrives to expand training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work H, C, N, O, P, S
OpenFF Torsion Multiplicity Torsion Drive Coverage Supplement v1.0 2024-06-14-OpenFF-Torsion-Multiplicity-Torsion-Drive-Coverage-Supplement-v1.0 Additional torsion drive training data for Sage 2.2.0 proper torsions and new parameters from the torsion multiplicity work N, Br, H, P, Cl, O, C, S
OpenFF Phosphate Torsion Drives v1.0 2024-07-17-OpenFF-Phosphate-Torsion-Drives-v1.0 Lipid-like phosphate torsions C, S, N, H, O, P
OpenFF Alkane Torsion Drives v1.0 2024-08-09-OpenFF-Alkane-Torsion-Drives-v1.0 Alka/ene torsion drives C, H

GridOptimization Datasets

These are currently used perform a scan of one or more internal coordinates (bond, angle, torsion), where optimizations are performed over a discrete set of values.

QCArchive Dataset Folder Description Elements Status
OpenFF Trivalent Nitrogen Set 1 2019-06-28-Nitrogen-grid-optimization Set of diverse trivalent nitrogen molecules for 1-D grid optimization. Si, Cl, Br, F, C, S, P, B, I, O, H, N Error
OpenFF Trivalent Nitrogen Set 2 2019-12-09-Nitrogen-grid-optimization-2d Set of diverse trivalent nitrogen molecules for 2-D grid optimization Si, Cl, Br, F, C, S, P, B, I, O, H, N Error
OpenFF Trivalent Nitrogen Set 3 2020-01-15-Nitogen-grid-optimization-02-1dscans Set of diverse trivalent nitrogen molecules for 1-D grid optimization, this is a secondary dataset Cl, Br, S, C, F, O, H, N Error