This repository provides implementations and code examples for Metabolite Inference with Spectrum Transformers (MIST). MIST models can be used to predict molecular fingerprints from tandem mass spectrometry data and, when trained in a contrastive learning framework, enable embedding and structure annotation by database lookup. Rather than directly embed binned spectra, MIST applies a transformer architecture to directly encode and learn to represent collections of chemical formula. MIST has also since been extended to predict precursor chemical formulae as MIST-CF.
Samuel Goldman, Jeremy Wohlwend, Martin Strazar, Guy Haroush, Ramnik J. Xavier, Connor W. Coley
Update: This branch provides an updated version of the MIST method for increased usability and developability. See the change log for specific details.
After git cloning the repository, the environment and package can be installed. Please note that the environment downloaded attempts to utilize cuda11.1. Please comment this line out in environment.yml if you do not plan to use gpu support prior to the commands below. We strongly recommend replacing conda with mamba for fast install (e.g., mamba env create -f environment.yml
).
conda env create -f environment.yml
conda activate ms-gen
pip install -r requirements.txt
python setup.py develop
This environment was tested on Ubuntu 20.04.1 with CUDA Version 11.4 . It takes roughly 10 minutes to install using Mamba.
After creating a python environment, pretrained models can be used to:
quickstart/model_predictions/fp_preds/
) quickstart/model_predictions/retrieval/
) quickstart/model_predictions/contrastive_embed/
) To showcase these capabilities, we include an MGF file, quickstart/quickstart.mgf
(a sample from the Mills et al. data), along with a set of sample smiles quickstart/lookup_smiles.txt
.
conda activate ms-gen
. quickstart/00_download_models.sh
. quickstart/01_run_models.sh
Output predictions can be found in quickstart/model_predictions
and are included by default with the repository. We provide an additional notebook notebooks/mist_demo.ipynb
that shows these calls programmatically, rather than in the command line.
Training models requires the use of paired mass spectra data and unpaired libraries of molecules as annotation candidates.
We utilize two datasets to train models:
Each paired spectra dataset will have the following standardized folders and components, living under a single dataset folder:
We are not able to redistribute the CSI2022 dataset. The canopus_train
dataset (including split changes) can be downloaded and prepared for minimal model execution:
. data_processing/canopus_train/00_download_canopus_data.sh
We intentionally do not include the retrieval HDF file in the data download, as the retrieval file is larger (>5GB). This can be re-made by following the instructions below to process PubChem (or one of the other unpaired libraries), then running python data_processing/canopus_train/03_retrieval_hdf.py
. The full data processing pipeline used to prepare relevant files can be found in data_processing/canopus_train/
(i.e., subformulae assignment, magma execution, retrieval and contrastive dataframe construction, subsetting of smiles to be used in augmentation, and assigning subformuale to the augmented mgf provided).
We consider processing three example datasets to be used as unpaired molecules: biomols, a dataset of biologicaly-relevant molecules prepared by Duhrkop et al. for the CANOPUS manuscript, hmdb, the Human Metabolome Database, and pubchem, the most complete dataset of molecules. Instructions for downloading and processing each of these can be found in data_processing/mol_libraries/
.
MIST uses these databases of molecules (without spectra) in two ways:
data/paired_spectra/canopus_train/aug_iceberg_canopus_train/
. See the ms-pred github repository for details on training a model and exporting an mgf. See data_processing/canopus_train/04_subset_smis.sh
for how we subsetted the biomolecules dataset to create labels for the ms-pred prediction and data_processing/canopus_train/05_buid_aug_mgf.sh
for how we process the resulting mgf into subformulae assignments after export. data_processing/canopus_train/03_retrieval_hdf.py
for call signatures to construct both of these, after creating a mapping of chem formula to smiles (e.g., data_processing/mol_libraries/pubchem/02_make_formula_subsets.sh
). After downloading the canopus_train dataset, the following two commands demonstrate how to train models that can be used (as illustrated in the quickstart). The config files specify the exact parameters used in experiments as reported in the paper.
MIST Fingerprint model:
CUDA_VISIBLE_DEVICES=0 python src/mist/train_mist.py \
--cache-featurizers \
--labels-file 'data/paired_spectra/canopus_train/labels.tsv' \
--subform-folder 'data/paired_spectra/canopus_train/subformulae/subformulae_default/' \
--spec-folder 'data/paired_spectra/canopus_train/spec_files/' \
--magma-folder 'data/paired_spectra/canopus_train/magma_outputs/magma_tsv/' \
--fp-names morgan4096 \
--num-workers 16 \
--seed 1 \
--gpus 1 \
--augment-data \
--batch-size 128 \
--iterative-preds 'growing' \
--iterative-loss-weight 0.4 \
--learning-rate 0.00077 \
--weight-decay 1e-07 \
--lr-decay-frac 0.9 \
--hidden-size 256 \
--pairwise-featurization \
--peak-attn-layers 2 \
--refine-layers 4 \
--spectra-dropout 0.1 \
--magma-aux-loss \
--magma-loss-lambda 8 \
--magma-modulo 512 \
--split-file 'data/paired_spectra/canopus_train/splits/canopus_hplus_100_0.tsv' \
--forward-labels 'data/paired_spectra/canopus_train/aug_iceberg_canopus_train/biomols_filtered_smiles_canopus_train_labels.tsv' \
--forward-aug-folder 'data/paired_spectra/canopus_train/aug_iceberg_canopus_train/canopus_hplus_100_0/subforms/' \
--frac-orig 0.6 \
--form-embedder 'pos-cos' \
--no-diffs \
--save-dir results/canopus_fp_mist/split_0
Contrastive model:
CUDA_VISIBLE_DEVICES=0 python src/mist/train_contrastive.py \
--seed 1 \
--labels-file 'data/paired_spectra/canopus_train/labels.tsv' \
--subform-folder 'data/paired_spectra/canopus_train/subformulae/subformulae_default/' \
--spec-folder 'data/paired_spectra/canopus_train/spec_files/' \
--magma-folder 'data/paired_spectra/canopus_train/magma_outputs/' \
--hdf-file 'data/paired_spectra/canopus_train/retrieval_hdf/intpubchem_with_morgan4096_retrieval_db_contrast.h5' \
--augment-data \
--contrastive-weight 0.6 \
--contrastive-scale 16 \
--num-decoys 64 \
--max-db-decoys 256 \
--decoy-norm-exp 4 \
--negative-strategy 'hardisomer_tani_pickled' \
--dist-name 'cosine' \
--learning-rate 0.00057 \
--weight-decay 1e-07 \
--scheduler \
--lr-decay-frac 0.7138 \
--patience 10 \
--gpus 1 \
--batch-size 32 \
--num-workers 8 \
--cache-featurizers \
--ckpt-file 'results/canopus_fp_mist/split_0/canopus_hplus_100_0/best.ckpt' \
--split-file 'data/paired_spectra/canopus_train/splits/canopus_hplus_100_0.tsv' \
--forward-labels 'data/paired_spectra/canopus_train/aug_iceberg_canopus_train/biomols_filtered_smiles_canopus_train_labels.tsv' \
--forward-aug-folder 'data/paired_spectra/canopus_train/aug_iceberg_canopus_train/canopus_hplus_100_0/subforms/' \
--frac-orig 0.2 \
--save-dir results/canopus_contrastive_mist/split_0
We detail our pipeline for executing updated experiments below. Because the comparisons on the CSI dataset require proprietary data, some will not be runnable. The execution and scripts are included here to help illustrate the logic. Results are precomputed and shown in the analysis notebooks (notebooks/
).
We provide summary statistics and chemical classifications of the CANOPUS (NPLIB1) dataset and combined dataset in notebooks/dataset_analysis.ipynb
. The chemical classes are assigned using NPClassifier, which is run via the GNPS endpoint. This is accessed in run_scripts/dataset_analysis/chem_classify.py
.
Hyperparameters were previously optimized using Ray Tune and Optuna as described in the released paper. We use a variation of these parameters by default, but provide additional scripts demonstrating the workflow for how to tune parameters. See run_scripts/hyperopt/
.
We compare four models using the partially proprietary CSI2022 dataset that includes NIST. These models are a feed forward network (FFN), Sinusoidal Transformer, MIST, and CSI:FingerID (as provided by the authors). Configurations for these models can be found and edited in configs/csi_compare
. The models themselves can be trained by running the following scripts:
. run_scripts/csi_fp_compare/train_ffn.sh
. run_scripts/csi_fp_compare/train_xformer.sh
. run_scripts/csi_fp_compare/train_mist.sh
After training models, predictions can be made with python run_scripts/csi_fp_compare/eval_models.py
. Results are analyzed generating partial paper figures in notebooks/fp_csi_compare.ipynb
To compare against this in future iterations, we recommend comparing against the following splits:
data/paired_spectra/canopus_train/splits/canopus_hplus_100_0.tsv
data/paired_spectra/canopus_train/splits/canopus_hplus_100_1.tsv
data/paired_spectra/canopus_train/splits/canopus_hplus_100_2.tsv
After training fingerprint models, a single contrastive model can be trained on top of the MIST fingerprint model.
. run_scripts/csi_retrieval_compare/train_contrastive.sh
With both trained fingerprint models and contrastive models, retrieval can be executed and evaluated:
python run_scripts/csi_retrieval_compare/mist_fp_retrieval.py
python run_scripts/csi_retrieval_compare/mist_contrastive_retrieval.py
# Conduct equivalent retrieval with SIRIUS/CSI:FingerID exported fingerprints
python run_scripts/csi_retrieval_compare/csi_fp_retrieval.py
python run_scripts/csi_retrieval_compare/analyze_results.py
Results of retrieval are subsequently be analyzed in notebooks/retrieval_csi_compare.ipynb
We also provide code for conducting a latent space retrieval analysis on the contrastive mist model.
# Download MS2DeepScore and spec2vec
. run_scripts/csi_embed_compare/download_baselines.sh
# Run baselines. Note that the cosine similarity calculation takes quite long
python run_scripts/csi_embed_compare/dist_baselines.py
# Run mist
python run_scripts/csi_embed_compare/dist_mist.py
These can be inspected in notebooks/embed_csi_compare.ipynb
We include code to compare the FFN and MIST models on the public dataset. Config files can be found in configs/canopus_compare/
. These can be run with the following scripts:
. run_scripts/canopus_compare/train_fp_ffn.sh
. run_scripts/canopus_compare/train_fp_mist.sh
After training these models, they can be evaluated using:
python run_scripts/canopus_compare/eval_fp_models.py
Results of predictions are subsequently be analyzed in notebooks/fp_canopus_compare.ipynb
.
We use two types of models for public retrieval: MIST FP and MIST contrastive models. The contrastive model requires an HDF5 set of decoys that can be made using data_processing/canopus_train/03_retrieval_hdf.py
. This creates both the retrieval database and also the contrastive training database. With this in hand, a contrastive model can be trained on top of the fingerprint model
. run_scripts/canopus_compare/train_contrastive_mist.sh
With both trained fingerprint models and contrastive models, retrieval can be executed and evaluated:
python run_scripts/canopus_compare/retrieval_fp.py
python run_scripts/canopus_compare/retrieval_contrastive_mist.py
python run_scripts/canopus_compare/eval_retrieval.py
Results are subsequently analyzed and inspected in notebooks/retrieval_canopus_compare.ipynb
We detail changes since the published MIST manuscript.
Model changes
src/mist/magma/frag_fp.py
. Analysis changes
Organizational changes
We ask users to cite MIST directly by referencing the following paper:
Goldman, S., Wohlwend, J., Stražar, M. et al. Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nat Mach Intell (2023). https://doi.org/10.1038/s42256-023-00708-3
MIST also builds on a number of other projects, ideas, and software including SIRIUS, MAGMa substructure labeling, the canopus_train data, the Mills et al. IBD data, NPClassifier to classify compounds, PubChem as a retrieval library, and HMDB as a retrieval library. Please consider citing the following full list of papers when relevant: