pluskal-lab / DreaMS

DreaMS (Deep Representations Empowering the Annotation of Mass Spectra)
https://dreams-docs.readthedocs.io
MIT License
20 stars 6 forks source link

DreaMS (Deep Representations Empowering the Annotation of Mass Spectra)

DreaMS (Deep Representations Empowering the Annotation of Mass Spectra) is a transformer-based neural network designed to interpret tandem mass spectrometry (MS/MS) data. Pre-trained in a self-supervised way on millions of unannotated spectra from our new GeMS (GNPS Experimental Mass Spectra) dataset, DreaMS acquires rich molecular representations by predicting masked spectral peaks and chromatographic retention orders. When fine-tuned for tasks such as spectral similarity, chemical properties prediction, and fluorine detection, DreaMS achieves state-of-the-art performance across various mass spectrometry interpretation tasks. The DreaMS Atlas, a comprehensive molecular network comprising 201 million MS/MS spectra annotated with DreaMS representations, along with pre-trained models and training datasets, is publicly accessible for further research and development. πŸš€ This repository provides the code and tutorials to: - πŸ”₯ Generate **DreaMS representations** of MS/MS spectra and utilize them for downstream tasks such as spectral similarity prediction or molecular networking. - πŸ€– **Fine-tune DreaMS** for your specific tasks of interest. - πŸ’Ž Access and utilize the extensive **GeMS dataset** of unannotated MS/MS spectra. - 🌐 Explore the **DreaMS Atlas**, a molecular network of 201 million MS/MS spectra from diverse MS experiments annotated with DreaMS representations and metadata, such as studied species, experiment descriptions, etc. - ⭐ Efficiently **cluster MS/MS spectra** in linear time using locality-sensitive hashing (LSH). Additionally, for further research and development: - πŸ”„ Convert conventional MS/MS data formats into our new, **ML-friendly HDF5-based format**. - πŸ“Š Split MS/MS datasets into training and validation folds using **Murcko histograms** of molecular structures. πŸ“š Please refer our [documentation](https://dreams-docs.readthedocs.io/) and paper ["Emergence of molecular structures from repository-scale self-supervised learning on tandem mass spectra"](https://chemrxiv.org/engage/chemrxiv/article-details/6626775021291e5d1d61967f) for more details. ## Getting started ### Installation Run the following code from the command line. ``` shell # Download this repository git clone https://github.com/pluskal-lab/DreaMS.git cd DreaMS # Create conda environment conda create -n dreams python==3.11.0 --yes conda activate dreams # Install DreaMS pip install -e . ``` If you are not familiar with conda or do not have it installed, please refer to the [official documentation](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html). ### Compute DreaMS representations To compute DreaMS representations for MS/MS spectra from `.mgf` file, run the following Python code. ``` python from dreams.api import dreams_embeddings embs = dreams_embeddings('data/examples/example_5_spectra.mgf') ``` The resulting `embs` object is a matrix with 5 rows and 1024 columns, representing 5 1024-dimensional DreaMS representations for 5 input spectra stored in the `.mgf` file. ## References - Paper: [https://chemrxiv.org/engage/chemrxiv/article-details/6626775021291e5d1d61967f](https://chemrxiv.org/engage/chemrxiv/article-details/6626775021291e5d1d61967f). - Documentation and tutorials: [https://dreams-docs.readthedocs.io/](https://dreams-docs.readthedocs.io/). - Weights of pre-trained models: [https://zenodo.org/records/10997887](https://zenodo.org/records/10997887). - Datasets: - GeMS dataset: [https://huggingface.co/datasets/roman-bushuiev/GeMS/tree/main/data](https://huggingface.co/datasets/roman-bushuiev/GeMS/tree/main/data). - DreaMS Atlas: [https://huggingface.co/datasets/roman-bushuiev/GeMS/tree/main/data/DreaMS_Atlas](https://huggingface.co/datasets/roman-bushuiev/GeMS/tree/main/data/DreaMS_Atlas). - Labeled MS/MS spectra: [https://huggingface.co/datasets/roman-bushuiev/GeMS/tree/main/data/auxiliary](https://huggingface.co/datasets/roman-bushuiev/GeMS/tree/main/data/auxiliary). If you use DreaMS in your research, please cite the following paper: ```bibtex @article{bushuiev2024emergence, author = {Bushuiev, Roman and Bushuiev, Anton and Samusevich, Raman and Brungs, Corinna and Sivic, Josef and Pluskal, TomΓ‘Ε‘}, title = {Emergence of molecular structures from repository-scale self-supervised learning on tandem mass spectra}, journal = {ChemRxiv}, doi = {doi:10.26434/chemrxiv-2023-kss3r-v2}, year = {2024} } ```