metaspace2020 / metaspace

Cloud engine and platform for metabolite annotation for imaging mass spectrometry
https://metaspace2020.eu/
Apache License 2.0
47 stars 10 forks source link

Optimization: Region-based sum-spectra #583

Closed LachlanStuart closed 2 years ago

LachlanStuart commented 4 years ago

The biggest pipeline bottleneck is how quickly it can extract images representing specific m/z-range slices, and calculate the 3 MSM metrics against these images. "Images" are 1000s to 100,000s of pixels. It's clear there is a huge amount of redundancy in the images - PCA decomposition, as an example, usually only needs 10s of components to explain 99% of the variance in a set of ion images.

Performing some dimensionality reduction to reduce the 1000s to 100,000s of spectra down to 10s of components, so that the "images" were also only 10s of pixels, could potentially improve the throughput of the pipeline 100-fold. In addition, intelligently combining spectra could actually improve the quality of the results by being more tolerant of noisy and low-intensity peaks, and expose opportunities for new .

The dimensionality reduction I propose has 3 phases:

  1. Recalibrate spectra so that they overlap well.
  2. Identify regions by taking images of the a variety of well-represented peaks, then doing clustering on them to assign each spectrum to a cluster.
  3. Merge spectra for each cluster into a high-resolution histogram. The centroided peaks never line up well, but in a high-res histogram they clearly resemble normal-distribution curves which could either be re-centroided, or used directly. The resulting sum spectra would fit into memory (about ~80MB per spectrum in my experiments if no re-centroiding is applied, so ~8GB for a complex dataset with 100 regions)

Regarding how this affects MSM score:

Implementation notes:

LachlanStuart commented 4 years ago

A similar approach was already attempted by Andy:

LachlanStuart commented 2 years ago

Closing this as it no longer has a champion