Optimization: Region-based sum-spectra

LachlanStuart commented 4 years ago

The biggest pipeline bottleneck is how quickly it can extract images representing specific m/z-range slices, and calculate the 3 MSM metrics against these images. "Images" are 1000s to 100,000s of pixels. It's clear there is a huge amount of redundancy in the images - PCA decomposition, as an example, usually only needs 10s of components to explain 99% of the variance in a set of ion images.

Performing some dimensionality reduction to reduce the 1000s to 100,000s of spectra down to 10s of components, so that the "images" were also only 10s of pixels, could potentially improve the throughput of the pipeline 100-fold. In addition, intelligently combining spectra could actually improve the quality of the results by being more tolerant of noisy and low-intensity peaks, and expose opportunities for new .

The dimensionality reduction I propose has 3 phases:

Recalibrate spectra so that they overlap well.
Identify regions by taking images of the a variety of well-represented peaks, then doing clustering on them to assign each spectrum to a cluster.
Merge spectra for each cluster into a high-resolution histogram. The centroided peaks never line up well, but in a high-res histogram they clearly resemble normal-distribution curves which could either be re-centroided, or used directly. The resulting sum spectra would fit into memory (about ~80MB per spectrum in my experiments if no re-centroiding is applied, so ~8GB for a complex dataset with 100 regions)

Regarding how this affects MSM score:

The spectral accuracy would be unchanged. It sums across all pixels, so as long as regions are weighted by the number of spectra they contain, this would be mathematically identical.
- There is an opportunity to improve this metric - instead of summing evenly over the +/-3ppm window, the spectral accuracy could correlate the high-res sum spectra against the predicted non-centroided spectrum shape.
The spatial accuracy would differ, but arguably would improve in quality, because there would be a substantial decrease in noise for low-intensity peaks
The chaos metric would not work at all on sum spectra. The pipeline would need a 2nd pass over the dataset to collect most-abundant-isotope images for annotations that pass the spectral/spatial filters. However, most formulae are filtered before this point, so a 2nd pass based on the formula that have passed the previous filters would be significantly faster than a normal image-based scan.

Implementation notes:

Choosing a bin size for the histogram is surprisingly complex because choosing a different sample rate to the sample rate used by the centroiding leads to very uneven, modulated distributions. With the dataset I looked at, m/z values were clipped to 32-bit values, so "every valid 32-bit floating point number" was the optimal choice for histogram bins.
- There are some advantages to using a bin size that increases with sqrt(mz), because this means that Orbitrap peaks always have the same width. However, the above sampling issue makes this very difficult as 32-bit precision number intervals do not increase smoothly.
- Moving the peaks around with recalibration further complicates this.
- Maybe there's a solution somewhere in signal processing for spreading each peak over multiple bins so that even when there are gaps in the distribution, there is enough overlap to smooth across the gaps...
- Maybe I'm overthinking it. It's very plausible that the "every 32-bit floating point value" solution will work for most datasets

LachlanStuart commented 4 years ago

A similar approach was already attempted by Andy:

LachlanStuart commented 2 years ago

Closing this as it no longer has a champion

metaspace2020 / metaspace

Optimization: Region-based sum-spectra #583