:warning: This is not the code repo for CSI:fingerID [Dührkop et al]. To use CSI:fingerID, please visit this page. CSI:fingerID extends the method of this repo in many ways, including but not limited to: more kernels for MS/MS spectra, a different set of molecular fingerprints and a different scoring function, etc. This repo is not actively maintained, however, if you spot a bug or have questions, please open an issue.
This is FingerID 1.4 release. This version will only focus on the fingerprints prediction, without fingerprints generation and compound database retrieval. This repo includes the code for the paper:
Shen, H., Dührkop, K., Böcker, S. and Rousu, J., 2014. Metabolite identification through multiple kernel learning on fragmentation trees. Bioinformatics, 30(12), pp.i157-i164.
This package utilize fragmentation tree, another view of MS/MS spectra, to improve fingerprints prediction.
The previous versions are hosted on sourceforge: http://sourceforge.net/projects/fingerid/
Please note that the package had NOT been tested on windows. Running the package on linux is suggested.
If you have root permission (assume setuptools installed):
python setup.py install
or if you do not have root permission:
python setup.py install --user
or in your python script:
import sys
sys.path.append("path_to_this_foler")
To use the package, three steps (parse, kernel, predict) are needed sequentially. Two examples are also provided in shen_ISMB2014.py and train_test.py.
Parse MS/MS spectra to the internal representation.
For the MS/MS data in the format as example dataset provided in the package, one can use the following:
from fingerid.preprocess.msparser import MSParser
# ms_folder is the folder for all the spectra.
msparser = MSParser()
ms_list = msparser.parse_dir(ms_folder)
For the MS/MS data downloaded from MassBank:
from fingerid.preprocess.massbankparser import MassBankParser
mbparser = MassBankParser()
# ms_folder is the folder for all the spectra.
ms_list = mbparser.parse_dir(ms_folder)
For the MS/MS data downloaded from Metlin (.msx format):
from fingerid.preprocess.metlinparser import MetlinParser
mlparser = MetlinParser()
# ms_folder is the folder for all the spectra.
ms_list = mlparser.parse_dir(ms_folder)
For the fragmentation tree in .dot format (fgtree_folder is the folder name for fragmentation tree data):
from fingerid.preprocess.fgtreeparser import FragTreeParser
fgtreeparser = FragTreeParser()
trees = fgtreeparser.parse_dir(fgtree_folder)
Two types of kernel functions are provided. For the MS/MS data, "PPK" kernel is used:
from fingerid.preprocess.msparser import MSParser
from fingerid.kernel.twodgaussiankernel import TwoDGaussianKernel
train_ms_list = msparser.parse_dir(train_ms_folder)
# Compute the PPK kernel with m/z variance sm and intensity variance si.
# In practice, tune the sm and si by cross validation is important.
kernel = TwoDGaussianKernel(sm, si)
train_km = kernel.compute_train_kernel(train_ms_list)
# When have test data, to compute test kernel
test_ms_list = msparser.parse_dir(test_ms_folder)
test_km = kernel.compute_test_kernel(test_ms_list,train_ms_list)
For fragmentation tree:
parse fragmentation tree
from fingerid.preprocess.fgtreeparser import FragTreeParser
fgtreeparser = FragTreeParser()
train_trees = fgtreeparser.parse_dir(train_fgtree_folder)
Compute training kernel
kernel = FragTreeKernel()
# Kernel can be "NB","NI","LB","LC","LI","RLB","RLI","CPC","CP2","CPK","CSC"
train_tree_km = kernel.compute_train_kernel(train_trees, "NB")
When have test data for fragmentation trees
test_trees = fgtreeparser.parse_dir(test_fgtree_folder)
n_train = len(train_trees)
n_test = len(test_trees)
kernel = FragTreeKernel()
# Kernel can be "NB","NI","LB","LC","LI","RLB","RLI","CPC","CP2","CPK","CSC"
train_tree_km = kernel.compute_train_kernel(train_trees, "NB")
test_tree_km = kernel.compute_test_kernel(test_trees, train_trees, "NB")
To combine the kernel using MKL (UNIMKL, ALIGN, ALIGNF):
# km_list is a list of kernel matrices (numpy 2d array).
# output is fingerprint matrix (numpy 2d array).
# The MKL algorithms can be 'UNIMKL', 'ALIGN' and 'ALIGNF'.
# ckm is combined kernel and kw is the weights for the kernels.
# The weights can be used to combine the test kernel.
from fingerid.kernel.mkl import mkl
ckm, kw = mkl(km_list, output, 'ALIGN')
To perform cross validation on training data:
from fingerid.model.internalCV_mp import internalCV
# kernel is the kernel matrix (numpy 2d array)
# labels is fingerprint matrix (numpy 2d array).
# n_folds is the number of folds used in the cross validation
# select_c is a boolean variable specify whether to do C selection in SVM.
n_folds = 5
cvpreds = internalCV(kernel, labels, n_folds, select_c=False)
To perform cross validation on training data with multiple processes. This is useful when you have many fingerprints (output) to train:
from fingerid.model.internalCV_mp import internalCV_mp
# n_p is the number of processes to be used
cvpreds = internalCV_mp(kernel, labels, n_folds, select_c=False, n_p=8)
To train the model on all the data instead of doing cross validation:
from fingerid.model.trainSVM import trainModels
model_dir = "MODELS" # model_dir is the folder to store the trained models
models = trainModels(kernel, labels, model_dir, select_c=False, n_p)
To predict on the test data using trained models:
from fingerid.model.trainSVM import trainModels
from fingerid.model.predSVM import predModels
model_dir = "MODELS" # model_dir is the folder to put the trained models
trainModels(train_kernel, labels, model_dir, select_c=False, n_p)
preds = predModels(test_kernel, n_fp, model_dir) # n_fp is the number of fingerprints
It's may be necessary to check whether the spectra and the fragmentation trees are in the same order as wanted. To output the order of the spectra files and fragmentation trees files which have been parsed:
from fingerid.preprocess.util import writeIDs
writeIDs("spectras.txt",train_ms)
writeIDs("fgtrees.txt", train_trees)
[1] Huibin Shen, Kai Dührkop, Sebastian Böcker and Juho Rousu: Metabolite Identification through Multiple Kernel Learning on Fragmentation Trees. In the proceedings of ISMB 2014, Bioinformatics 30(12), i157-i164 (2014).
[2] Huibin Shen, Niocola Zamboni, Markus Heinonen, Juho Rousu: Metabolite identification through machine learning -- tackling casmi challenge using fingerid. Metabolites 3(2), 484--505 (2013).
[3] Markus Heinonen, Huibin Shen, Niocola Zamboni, Juho Rousu: Metabolite identification and molecular fingerprint prediction through machine learning. In the proceedings of MLSB 2012, Bioinformatics 28(18), 2333--2341 (2012).