Data analysis: MSM/FDR metric improvements

Meta-task for listing problems or potential improvements as we discover them. Tasks here aren't necessarily worth pursuing, but having a record of them will help prevent duplicate investigation.

Ion image extraction

[ ] Normalize different spectra within the same dataset. Currently many datasets have a "gradient" effect where the intensities will gradually fade in/out on one side/corner of the image. There are multiple potential explanations for this, but if we could figure out the cause, we could potentially model it and correct for it. This requires further data analysis.
- [ ] TIC normalization showed some promise but also was a bit of a foot-gun as it could introduce new features into some images. Veronika linked this paper on normalizing images of a standard against itself, but it remains to be seen whether we could normalize against e.g. matrix molecules. Because we rescale images in the browser, it would be very easy to implement configurable/toggleable normalization if the TIC/RMS/etc. images were available.
[ ] Adaptive PPM. There are several ways to do this:
- [ ] test several PPM values on a sample of the dataset to see how strict we can safely take it before data is lost
- [ ] estimate m/z drift & compensate for it (i.e. revisit recalibration, but possibly try for a more dataset-local approach this time)
- [ ] use a multiple of FWHM instead of a fixed PPM
- [ ] use cpyMspec/ims-cpp's ability to scale the resolving power curve based on instrument type: TOF has constant RP, FTICR has RP = RPbase * 200 / mz, Orbitrap has RP = RPbase * sqrt(200 / mz), we currently use the FTICR formula for everything
[ ] Simplify metric calculating by using meaningful region-based sum spectra #583

MSM Spatial

[ ] ~Predict missing pixels for less abundant isotopes based on the dataset's minimum allowed intensity value so that low-intensity annotations don't get penalized for having missing pixels in their secondary isotopes' ion images.~ This is probably not the best approach - I feel blurring images or clustering & summing spectra will achieve the same effect but much more reliably
[ ] Blur images before calculating the spatial correlation. This effectively fills in missing pixels, allowing noisy images to achieve higher MSM scores. I've tested this with 3x3 and 5x5 averaging kernels and both improved the number & linearity of annotations.
- Blurring seems to be unacceptably slow. It seems like a better approach would be to downsample.
- I tested several blur kernel radii - 2, 3, 4, 8px. All gave improved results, but it seems dataset-dependent. For ML it'd probably be best to calculate spatial at full, 1/2, 1/4, and 1/8 res and let the ML filter decide which to use

MSM Spectral

[ ] Handle cases where the high resolving power allows peaks closer than 3ppm e.g. for C40H74NO8P+H at the best supported resolving power (1000K @ 200m/z) there are two peaks, 730.5267 and 730.5292, which overlap significantly. In the Untreated dataset, 83% of peaks detected for these two isotopes fall into the overlapping ppm area:
[ ] Cluster pixels and sum spectra within each cluster to reconstruct a high-resolution/low-noise spectrum. Use these to match against the expected spectrum shape, not just for summing within a ppm-window.

FDR

[ ] #147 Better FDR through ML
[x] Save continuous FDR scores instead of rounding them up to 5/10/20/50%
[ ] Compress MSM->FDR rankings into histograms so that they're small enough to save and use later, and convenient for doing after-annotation lookups e.g. for #151.
[ ] Model FDR error correctly so that over-confident "0% FDR" annotations don't happen. FDR scores are somewhat analogous to online reviews where we have a finite number of either positive or negative samples and we need to model the distribution to make good decisions based on it. Here's a good overview of relevant math. In particular, Laplace's Rule of Succession suggests that instead of calculating # decoy annotations / # target annotations we should actually be calculating (# decoy annotations + 1) / (# target annotations + 1).
[ ] Calculate FDR on a per-region basis. Often molecules are much higher quality in the regions that they're more abundant. See e.g. Veronika's experiment of 8 wells in one dataset vs 8 datasets of 1 well each - the separated datasets overall annotated more molecules.

metaspace2020 / metaspace

Data analysis: MSM/FDR metric improvements #317

Ion image extraction

MSM Spatial

MSM Spectral

FDR