hgb-bin-proteomics / CandidateSearch

Proof-of-concept implementation of a search engine that uses sparse matrix multiplication to identify the best peptide candidates for a given mass spectrum.
https://hgb-bin-proteomics.github.io/CandidateSearch
MIT License
1 stars 1 forks source link
cuda eigen engine gpu identification mass mass-spectrometry peptide peptide-identification proteomics psm search search-engine sparse spectrometry spgemm spmm spmv

test_state_windows test_state_ubuntu test_state_macos

CandidateSearch

Proof-of-concept implementation of a search engine that uses CandidateVectorSearch to identify the best peptide candidates for a given mass spectrum. CandidateSearch is also the computational backend of the non-cleavable crosslink search in MS Annika. CandidateSearch creates the vector encodings of peptides and spectra that are needed for the sparse matrix search of CandidateVectorSearch.

CandidateSearch can identify peptide candidates from a given mass spectrum without any precursor ion/mass information and no previous knowledge about potential fixed or variable modifications. CandidateSearch can also identify peptidoform candidates if a set of fixed and variable modifications is provided. The aim of CandidateSearch is to reduce the search space for a given identification task by filtering out unlikely peptide or peptidoform candidates. It is NOT meant to be a standalone search engine for peptide/peptidoform identification.

A simplified break down of the CandidateSearch algorithm is given in the following:

Usage

Running CandidateSearch requires three files:

The CandidateSearch executable can then be run like this:

CandidateSearch.exe spectra.mgf database.fasta settings.txt

Example files that can be used to test CandidateSearch can be found in /data.

Settings

The settings file accepts the following parameters:

For the last five parameters you might additionally want to check the documentation of CandidateVectorSearch to get a better understanding of their meaning.

An empty settings.txt file is a valid configuration for search (default parameters will be used), however not providing a settings file at all is not valid.

An example settings.txt file is provided here.

Additionally its contents are listed below, which should help in understanding the formatting:

## DIGESTIONS PARAMETERS
MAX_CLEAVAGES = 2
MIN_PEP_LENGTH = 5
MAX_PEP_LENGTH = 30

## ION CALCULATION PARAMETERS
MAX_PRECURSOR_CHARGE = 4
MAX_FRAGMENT_CHARGE = +1
MAX_NEUTRAL_LOSSES = 1
MAX_NEUTRAL_LOSS_MODS = 2
#FIXED_MODIFICATIONS = None
FIXED_MODIFICATIONS = C:57.021464;
#VARIABLE_MODIFICATIONS = None
VARIABLE_MODIFICATIONS = M:15.994915;
#VARIABLE_MODIFICATIONS = M:15.994915;K:284.173607;

## SEARCH PARAMETERS
DECOY_SEARCH = true

## VECTOR SEARCH PARAMETERS
TOP_N = 1000
TOLERANCE = 0.02
NORMALIZE = false
USE_GAUSSIAN = true
MODE = CPU_SMi32

Documentation

The code of this search engine is fully documented within the .cs code files. A good entry point is the main function of CandidateSearch which is implemented here. Documentation generated by Doxygen is also available here: https://hgb-bin-proteomics.github.io/CandidateSearch/

Requirements

Downloads

Compiled DLLs and and executables are available in the exe folder or in Releases.

We supply compiled executables and DLLs for:

For other operating systems/architectures please compile the source code yourself! You will also need to compile CandidateVectorSearch!

Limitations

This a proof-of-concept implementation that shows the applicability of our CandidateVectorSearch approach and not a fully fledged search engine, therefore this implementation comes with a few limitations:

Results

Example results of CandidateSearch and results analysis are given in tests. An extensive report is given in results.md.

Results on a HeLa dataset

Figure 1: Identifying peptide candidates and peptidoform candidates with CandidateSearch [v1.0.0] in a HeLa dataset using the human swissprot database. The considered ground truth was an MS Amanda search validated with Percolator. For every high-confidence PSM we checked if the identified peptide/peptidoform was among the top 50/100/500/1000 hits of CandidateSearch. We reach almost 100% coverage within the first 1000 hits of CandidateSearch (for reference: the whole database contained ~4 200 000 peptides or ~10 500 000 peptidoforms).

Benchmarks

Benchmarks of the different algorithms can be found in benchmarks.md.

benchmark_hpc_1A

Figure 2: Int32-based sparse matrix * dense matrix search using Eigen generally yields the fastest computation time on modern CPUs.

Known Issues

List of known issues

Citing

If you are using [parts of] CandidateSearch please cite:

MS Annika 3.0 (publication wip)

License

Contact

micha.birklbauer@fh-hagenberg.at