openforcefield / protein-ligand-benchmark

Protein-Ligand Benchmark Dataset for Free Energy Calculations
MIT License
150 stars 15 forks source link

ProteinLigandBenchmarks

build codecov Language grade: Python Documentation Status Code style: black DOI

Protein-Ligand Benchmark Dataset for testing Parameters and Methods of Free Energy Calculations.

Documentation

Documentation for the protein-ligand-benchmark package is hosted at readthedocs.

Related Publication

The LiveCoMS article on "Best practices for constructing, preparing, and evaluating protein-ligand binding affinity benchmarks" provides accompanying information to this benchmark dataset and how to use it for alchemical free energy calculations. For any suggestions of improvements please raise an issue in its GitHub repository protein-ligand-benchmark-livecoms.

Installation

The repository uses git-lfs (large file storage) for the storage of all the data file. Ideally git-lfs is installed first before cloning the repository.

conda create -n plbenchmark python=3.7 git-lfs
conda activate plbenchmark
git lfs clone https://github.com/openforcefield/protein-ligand-benchmark.git
cd protein-ligand-benchmark
conda env update --file environment.yml
pip install -e .

Getting Started

Example notebooks can be found in the Documentation and in examples. Paper repository here.

Data file tree and file description

The data is organized as followed:

data
├── targets.yml                               # list of all targets and their directories   
├── <date>_<target_name_1>                    # directory for target 1
│   ├── 00_data                               #     metadata for target 1
│   │   ├── edges.yml                         #         edges/perturbations
│   │   ├── ligands.yml                       #         ligands and activities
│   │   └── target.yml                        #         target
│   ├── 01_protein                            #     protein data
│   │   ├── crd                               #         coordinates
│   │   │   ├── cofactors_crystalwater.pdb    #             cofactors and cyrstal waters (might be empty if there are none)  
│   │   │   └── protein.pdb                   #             aminoacid residues   
│   │   └── top                               #         topology(s)
│   │   │   └── amber99sb-star-ildn-mut.ff    #             force field spec.     
│   │   │       ├── cofactors_crystalwater.top#                 Gromacs TOP file of cofactors and crystal water (might be empty if there are none)
│   │   │       ├── protein.top               #                 Gromacs TOP file of amino acid residues
│   │   │       └── *.itp                     #                 Gromacs ITP file(s) to be included in TOP files
│   └── 02_ligands                            #     ligands
│   ├── lig_<name_1>                          #          ligand 1 
│   │   ├── crd                               #              coordinates
│   │   │   └── lig_<name_1>.sdf              #                  SDF file
│   │   └── top                               #              topology(s)
│   │       └── openff-1.0.0.offxml           #                  force field spec.       
│   │           ├── fflig_<name_1>.itp        #                      Gromacs ITP file : atom types     
│   │           ├── lig_<name_1>.itp          #                      Gromacs ITP file       
│   │           ├── lig_<name_1>.top          #                      Gromacs TOP file                
│   │           └── posre_lig_<name_1>.itp    #                      Gromacs ITP file : position restraint file  
│   ├── lig_<name_2>                          #         ligand 2                               
│   …                                        
│   └── 03_hybrid                             #    edges (perturbations)
│   ├── edge_<name_1>_<name_2>                #         edge between ligand 1 and ligand 2   
│   │   └── water                             #             edge in water 
│   │       ├── crd                           #                 coordinates 
│   │       │   ├── mergedA.pdb               #                     merged conf based on coords of ligand 1  
│   │       │   ├── mergedB.pdb               #                     merged conf based on coords of ligand 2   
│   │       │   ├── pairs.dat                 #                     atom mapping                  
│   │       │   └── score.dat                 #                     similarity score         
│   │       └── top                           #                 topology(s)       
│   │           └── openff-1.0.0.offxml       #                     force field spec.         
│   │               ├── ffmerged.itp          #                         Gromacs ITP file  
│   │               ├── ffMOL.itp             #                         Gromacs ITP file   
│   │               └── merged.itp            #                         Gromacs ITP file     
│   …                                        
├── <date>_<target_name_2>                    # directory for target 2  
…

Description of meta data YAML files

targets.yml

This file lists all the registered targets in the benchmark set. Each entry denotes one target and contains the following information:

mcl1_sample:
  name:     mcl1_sample
  date:     2020-08-26
  dir:      2020-08-26_mcl1_sample

mcl1_sample is the entry name and each entry has three sub-entries:

target.yml

This file is found in the meta data directory of each target: <date>_<target_name>/00_data/target.yml. It contains additionally information about the target:

alternate:
  iridium_classifier: HT
  iridium_score: 0.3
  pdb: 6O6F
associated_sets:
- Schrodinger JACS
comments: hydrophobic interactions contributing to binding
date: 2019-12-13
dpi: 0.26
id: 9
iridium_classifier: HT
iridium_score: 0.41
name: mcl1
netcharge: 4 e
pdb: 4HW3
references:
  calculation:
  - 10.1021/ja512751q
  - 10.1021/acs.jcim.9b00105
  - 10.1039/C9SC03754C
  measurement:
  - 10.1021/jm301448p

Explanation of the entries:

ligands.yml

This file is found in the meta data directory of each target: <date>_<target_name>/00_data/ligands.yml. It contains information of the ligands of one target. One entry looks like this:

lig_23:
  measurement:
    comment: Table 2, entry 23
    doi: 10.1021/jm301448p
    error: 0.03
    type: ki
    unit: uM
    value: 0.37
  name: lig_23
  smiles: '[H]c1c(c(c2c(c1[H])c(c(c(c2OC([H])([H])C([H])([H])C([H])([H])C3=C(Sc4c3c(c(c(c4[H])[H])[H])[H])C(=O)[O-])[H])[H])[H])[H])[H]'

Explanation of the entries:

edges.yml

This file is found in the meta data directory of each target: <date>_<target_name>/00_data/edges.yml. It contains information of the edges of one target. One entry looks like this:

edge_50_60:
  ligand_a: lig_50
  ligand_b: lig_60

Each entry is just a list of two ligand identifiers.

Summary

Summary of the contents of the Protein-Ligand Benchmark Dataset. It contains the available protein targets with corresponding PDB ID and number of ligands.

Target PDB N. Lig.
bace 4DJW 36
bace_hunt 4JPC 32
bace_p2 3IN4 12
cdk2 1H1Q 16
cdk8 5HNB 33
cmet 4R1Y 12
eg5 3L9H 28
galectin 5E89 8
hif2a 5TBM 42
jnk1 2GMX 21
mcl1 4HW3 42
p38 3FLY 34
pde10 4BBX 35
pde2 6EZF 21
pfkfb3 6HVI 40
ptp1b 2QBS 23
shp2 5EHR 26
syk 4PV0 44
thrombin 2ZFF 11
tnks2 4UI5 27
tyk2 4GIH 16

Release History

Releases follow the major.minor.micro scheme recommended by PEP440, where

Contributions

License

MIT. See the License File for more information.

CC-BY-4.0 for data (content of directory data). See the License File for more information.

Copyright

Copyright (c) 2021, Open Force Field Consortium, David F. Hahn

Acknowledgements

Project based on the Computational Molecular Science Python Cookiecutter version 1.1.