Proof-of-concept implementation of a search engine that uses CandidateVectorSearch to identify the best peptide candidates for a given mass spectrum. CandidateSearch is also the computational backend of the non-cleavable crosslink search in MS Annika. CandidateSearch creates the vector encodings of peptides and spectra that are needed for the sparse matrix search of CandidateVectorSearch.
CandidateSearch can identify peptide candidates from a given mass spectrum without any precursor ion/mass information and no previous knowledge about potential fixed or variable modifications. CandidateSearch can also identify peptidoform candidates if a set of fixed and variable modifications is provided. The aim of CandidateSearch is to reduce the search space for a given identification task by filtering out unlikely peptide or peptidoform candidates. It is NOT meant to be a standalone search engine for peptide/peptidoform identification.
A simplified break down of the CandidateSearch algorithm is given in the following:
Running CandidateSearch requires three files:
The CandidateSearch executable can then be run like this:
CandidateSearch.exe spectra.mgf database.fasta settings.txt
Example files that can be used to test CandidateSearch can be found in /data
.
The settings file accepts the following parameters:
(char)amino_acid:(double)modification_mass
. An example
would be carbamidomethylation of cysteine, which would be denoted as C:57.021464;
. Several fixed modifications can be provided. (string, default = None)(char)amino_acid:(double)modification_mass
. An example
would be oxidation of methionine, which would be denoted as M:15.994915;
. Several variable modifications can be provided. If no modifications are
given, CandidateSearch will return the best scoring unmodified peptidoforms for a given spectrum. (string, default = None)true
or false
. (bool, default = true)true
or false
.
(bool, default = false)mu = (m/z)
and sigma = (tolerance/3)
.
Accepts true
or false
. (bool, default = true) For the last five parameters you might additionally want to check the documentation of CandidateVectorSearch to get a better understanding of their meaning.
An empty settings.txt
file is a valid configuration for search (default parameters will be used), however not providing a settings file at all is
not valid.
An example settings.txt
file is provided here.
Additionally its contents are listed below, which should help in understanding the formatting:
## DIGESTIONS PARAMETERS
MAX_CLEAVAGES = 2
MIN_PEP_LENGTH = 5
MAX_PEP_LENGTH = 30
## ION CALCULATION PARAMETERS
MAX_PRECURSOR_CHARGE = 4
MAX_FRAGMENT_CHARGE = +1
MAX_NEUTRAL_LOSSES = 1
MAX_NEUTRAL_LOSS_MODS = 2
#FIXED_MODIFICATIONS = None
FIXED_MODIFICATIONS = C:57.021464;
#VARIABLE_MODIFICATIONS = None
VARIABLE_MODIFICATIONS = M:15.994915;
#VARIABLE_MODIFICATIONS = M:15.994915;K:284.173607;
## SEARCH PARAMETERS
DECOY_SEARCH = true
## VECTOR SEARCH PARAMETERS
TOP_N = 1000
TOLERANCE = 0.02
NORMALIZE = false
USE_GAUSSIAN = true
MODE = CPU_SMi32
The code of this search engine is fully documented within the .cs
code files. A good entry point is the main function of CandidateSearch which is
implemented here. Documentation generated by
Doxygen is also available here:
https://hgb-bin-proteomics.github.io/CandidateSearch/
Compiled DLLs and and executables are available in the exe
folder or in
Releases.
We supply compiled executables and DLLs for:
For other operating systems/architectures please compile the source code yourself! You will also need to compile CandidateVectorSearch!
This a proof-of-concept implementation that shows the applicability of our CandidateVectorSearch approach and not a fully fledged search engine, therefore this implementation comes with a few limitations:
Example results of CandidateSearch and results analysis are given in tests
. An extensive report is given in results.md.
Figure 1: Identifying peptide candidates and peptidoform candidates with CandidateSearch [v1.0.0] in a HeLa dataset using the human swissprot database. The considered ground truth was an MS Amanda search validated with Percolator. For every high-confidence PSM we checked if the identified peptide/peptidoform was among the top 50/100/500/1000 hits of CandidateSearch. We reach almost 100% coverage within the first 1000 hits of CandidateSearch (for reference: the whole database contained ~4 200 000 peptides or ~10 500 000 peptidoforms).
Benchmarks of the different algorithms can be found in benchmarks.md.
Figure 2: Int32-based sparse matrix * dense matrix search using Eigen generally yields the fastest computation time on modern CPUs.
If you are using [parts of] CandidateSearch please cite:
MS Annika 3.0 (publication wip)