t7morgen / misato-dataset

GNU Lesser General Public License v2.1
180 stars 16 forks source link
# MISATO - Machine learning dataset of protein-ligand complexes for structure-based drug discovery [![python](https://img.shields.io/badge/-Python_3.7_%7C_3.8_%7C_3.9_%7C_3.10-blue?logo=python&logoColor=white)](https://github.com/pre-commit/pre-commit) [![pytorch](https://img.shields.io/badge/PyTorch_1.10+-ee4c2c?logo=pytorch&logoColor=white)](https://pytorch.org/get-started/locally/) [![lightning](https://img.shields.io/badge/-Lightning_1.8+-792ee5?logo=pytorchlightning&logoColor=white)](https://pytorchlightning.ai/)

:earth_americas: Where we are:

:electron: Vision:

We are a drug discovery community project :hugs:

Lets crack the 100+ ns MD, 30000+ protein-ligand structures and a whole new world of AI models for drug discovery together.

Check out the paper!

Alt text

:purple_heart: Community

Want to get hands-on for drug discovery using AI?

Join our discord server!

Check out our Hugging Face spaces to run and visualize the adaptability model and to perform QM property predictions.

πŸ“ŒΒ Β Introduction

In this repository, we show how to download and apply the Misato database for AI models. You can access the calculated properties of different protein-ligand structures and use them for training in Pytorch based dataloaders. We provide a small sample of the dataset along with the repo.

You can freely download the FULL MISATO dataset from Zenodo using the links below:

wget -O data/MD/h5_files/MD.hdf5 https://zenodo.org/record/7711953/files/MD.hdf5
wget -O data/QM/h5_files/QM.hdf5 https://zenodo.org/record/7711953/files/QM.hdf5

Start with the notebook src/getting_started.ipynb to :

πŸš€Β Β Quickstart

We recommend to pull our MISATO image from DockerHub or to create your own image (see docker/). The images use cuda version 11.8. We recommend to install on your own system a version of CUDA that is a least 11.8 to ensure that the drivers work correctly.

# clone project
git clone https://github.com/t7morgen/misato-dataset.git
cd misato-dataset

For singularity use:

# get the container image
singularity pull docker://sab148/misato-dataset
singularity shell misato.sif

For docker use:

sudo docker pull sab148/misato-dataset:latest
bash docker/run_bash_in_container.sh


Project Structure

β”œβ”€β”€ data                   <- Project data
β”‚   β”œβ”€β”€MD 
β”‚   β”‚   β”œβ”€β”€ h5_files           <- storage of dataset
β”‚   β”‚   └── splits             <- train, val, test splits
β”‚   └──QM
β”‚   β”‚   β”œβ”€β”€ h5_files           <- storage of dataset
β”‚   β”‚   └── splits             <- train, val, test splits
β”‚
β”œβ”€β”€ src                    <- Source code
β”‚   β”œβ”€β”€ data                    
β”‚   β”‚   β”œβ”€β”€ components           <- Datasets and transforms
β”‚   β”‚   β”œβ”€β”€ md_datamodule.py     <- MD Lightning data module
β”‚   β”‚   β”œβ”€β”€ qm_datamodule.py     <- QM Lightning data module
β”‚   β”‚   β”‚
β”‚   β”‚   └── processing           <- Skripts for preprocessing, inference and conversion
β”‚   β”‚      β”œβ”€β”€...    
β”‚   β”œβ”€β”€ getting_started.ipynb     <- notebook : how to load data and interact with it
β”‚   └── inference.ipynb           <- notebook how to run inference
β”‚
β”œβ”€β”€ docker                    <- Dockerfile and execution script 
└── README.md




Installation using your own conda environment

In case you want to use conda for your own installation please create a new misato environment.

In order to install pytorch geometric we recommend to use pip (within conda) for installation and to follow the official installation instructions:pytorch-geometric/install

Depending on your CUDA version the instructions vary. We show an example for the CUDA 11.8.

conda create --name misato python=3
conda activate misato
conda install -c anaconda pandas pip h5py
pip3 install torch --index-url https://download.pytorch.org/whl/cu118 --no-cache
pip install joblib matplotlib
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
pip install pytorch-lightning==1.8.3
pip install torch-geometric
pip install ipykernel==5.5.5 ipywidgets==7.6.3 nglview==2.7.7
conda install -c conda-forge nb_conda_kernels

To run inference for MD you have to install ambertools. We recommend to install it in a separate conda environment.

conda create --name ambertools python=3
conda activate ambertools
conda install -c conda-forge ambertools nb_conda_kernels
pip install h5py jupyter ipykernel==5.5.5 ipywidgets==7.6.3 nglview==2.7.7

Citation

If you found this work useful please consider citing the article.

@article{siebenmorgen2024misato,
  title={MISATO: machine learning dataset of protein--ligand complexes for structure-based drug discovery},
  author={Siebenmorgen, Till and Menezes, Filipe and Benassou, Sabrina and Merdivan, Erinc and Didi, Kieran and Mour{\~a}o, Andr{\'e} Santos Dias and Kitel, Rados{\l}aw and Li{\`o}, Pietro and Kesselheim, Stefan and Piraud, Marie and Theis, Fabian J. and Sattler, Michael and Popowicz, Grzegorz M.},
  journal={Nature Computational Science},
  pages={1--12},
  year={2024},
  publisher={Nature Publishing Group US New York}
}