~Depends on #796~ (Recalibration has advanced enough that this is no longer blocked) Closes #147

Plan:

[x] Present an overview of the plan to the wider group & ask for feedback / brainstorm ways of evaluation (aiming for week of Aug 23-27)
[x] Select 200-500 datasets for evaluation, and a subset of ~100 for checking recalibration
[ ] Implement all candidate algorithm changes & change pipeline to save all FDR inputs to file
[ ] Evaluate results
[ ] (If evaluation is favorable) integrate into METASPACE - requires new diagnostics UI, DB changes, etc.

Algorithm changes

Points marked (non-core) are not a vital part of the project, but may reveal easy wins. They can be skipped if implementation would take too much time.

Pre-annotation

From spectrum sampling during DS loading, calculate:
[x] Limit of detection (0.1th %ile intensity value) at high & low mass range
[x] Centroid "jitter" (median stddev of centroids that are assumed to be the same underlying ion) at high & low intensity, high & low mass range
[x] Minimum distance between peaks (a proxy indicator of effective RP)
Decoys / isotope generation
[x] (non-core) Include +Decoy+H, +Decoy+Na, +Decoy+K so that sets of adducts can be evaluated?
[x] (non-core) Include "M-1" peak (i.e. monoisotopic mass minus one 13C-12C mass difference)
[x] (non-core) Include an extra M+3 peak for cases when there are two M+2 peaks and their imaging window overlaps
[x] Include all 80 decoy adducts for these tests so that they can all be evaluated

Image/metric sets

For every batch of images for which metrics will be calculated, the following combinations of images should have metrics calculated:

Fixed window, 4 isotopic peaks @ 3ppm (current behavior)
[x] fixed window, 4 isotopic peaks @ 3ppm-at-200 (analysis_version==2 behavior, scaled based on instrument's resolving power over the mass range)
[x] Fixed window, 4 isotopic peaks @ 3-sigma (sigma - detected standard error of m/zs)
[x] (non-core) Original fixed window @ 1ppm (to see whether metrics improve/worsen, despite the images always worsening)
[x] (non-core) "Dynamic window" - +/- 3ppm-at-200 centered on most dense region of mass space within a +/-5ppm window
[ ] (non-core) sub-images of original 3ppm window for data augmentation

Metrics

Original 3 metrics
Spectral
[x] theoretical & observed m/zs (currently not saved and recalculated on demand, but that will make analysis more difficult)
[x] mass accuracy (absolute, inter-isotope, stddev of centroids)
- Normalized to constant FWHM
- Normalized to constant FWHM and within dataset
  Spatial
[x] Weighted rhoSpatial
[x] log-Weighted rhoSpatial
[x] downsampled rhoSpatial
[x] limit-of-detection-aware rhoSpatial
[x] Try spearman correlation (maybe only on downsampled)
Chaos
[x] rhoChaos - check if nlevels can be reduced
[x] rhoChaos - check if equivalent results on spatially-reduced image
[x] ~(non-core) rhoChaos - review C++ code - seems to be common image functions, is an skimage implementation faster? Can the "1.0 is actually a fail" bug be fixed?~
- Update: Several optimizations are possible, but they require Numba because the connected component counting is the main bottleneck and skimage/scipy's label functions are too slow. I've also added a number of potential improvements, each as separate metrics so that they can be trialed on a larger scale:
- Geometric scaling (instead of linear) of the intensity thresholds defining the "levels" where chaos is tested - should give less strict results on multi-modal images (i.e. images with both low-intensity and high-intensity regions)
- 8-way dilation (instead of 4-way), which preserves single-pixel islands, causing them to significantly reduce the score
- Per-level calculation of the ratio of components to non-zero pixels (previously the average #components across all levels was divided by the count of non-zero pixels for the whole image)
- Counting of non-zero pixels based on the eroded image, not the original image, as that the former paradoxically improves score when there's more single-pixel islands
  Other stats
[x] Image coverage (% of non-zero pixels)
[x] Dataset size (i.e. # of spectra)
[x] mean/max/hotspot intensity
[x] Perf stats for every metric
[x] (non-core) spatial and spectral correlation of M to M-1

Derived metrics

(i.e. metrics that can be calculated outside of the annotation pipeline based on the above)

[ ] Median rhoSpatial, rhoSpectral, mass accuracy, image coverage, dataset size
[ ] (non-core) Highest scores among all other adducts for the same formula (risky - need to determine how to handle decoys)
[ ] "Inline-recalibrated" mass accuracy (use RANSAC on the mass errors to regress out consistent mass errors)

Other pipeline changes

[ ] Collect all metrics and dump to pickle file
[x] Remove late-pipeline FDR, PNG generation, colocalization, etc. as they're unneeded

Evaluation

Test datasets

200-500 public or EMBL-private datasets with stratified random selection based on group, mode, matrix, source, analyzer, mass range, exporter software, aiming to over-represent less common values for each class
50-100 of those datasets will be annotated both before and after annotation
20:80 test:train split
Cross-validation within the training set for hyperparameter optimization

Evaluation criteria

Through discussion it has been clear that there are multiple angles from which the model needs to be validated, and no individual metric seems to satisfy all of them:

Evaluating overfitting / memorization

These should be done based on the F1 score (or similar metric) on the output of the model, as the later FDR process would likely "smooth over" these negative effects.

[ ] (Theo) Is there a significant gap between accuracy on test/train datasets?

Many models have built-in overfitting protection, but it's also possible to measure overfitting generally by comparing the model's accuracy between train & validate sets.

As cross-validation will be used for the optimization and evaluation, the existence of overfitting does not compromise the analysis. The value of detecting overfitting is that it signals when the model is performing sub-optimally, i.e. that the validation set's score might be improvable by taking more steps to reduce overfitting.

When there are sparse areas of feature space, decision tree models can start to memorize data points even in early epochs before the learning curve actually shows

Evaluating input bias

IMO this is one of the bigger areas of risk. If the model can't generalize, it can't safely be deployed for general-purpose use across METASPACE.

[ ] How well does the model handle cases not in the training set? (i.e. use Leave One Group Out to exclude e.g.all MALDI datasets, all EMBL datasets, etc. and see which axes/groups are most sensitive)

Evaluating model performance

It's unclear yet which metric will be best for representing the objective we want to optimize, so try all of these and check the results:

[ ] AUC for cumulative #annotations vs FDR (effectively ROC curve without the denominator on either axis), with/without log-weighting of FDR to prioritize lower FDRs
[ ] AOC for precision-recall curve
[ ] Maximum F1 sampled at every point across the curve
[ ] F1 sampled at various points across the curve
[ ] mAP

The measure should prioritize both the number of annotations and how low their FDR%s are, e.g. 10 annotations at FDR<=5% is preferable to having 12 at FDR<=20%.

Evaluating "Success" / marketing the model to users

The stated objective is "90% of datasets get at least 10% increase of annotations at FDR<= 10%.". For advertising, we should keep that format and find the most appealing combination of %s.

Training

Several aspects need to be optimized:

Which model to be used
Which features to use
Which model hyperparameters to use
[ ] Lucas suggests using something like grid search for hyperparameter optimization
[ ] Theo suggests trying out two-pass training, where the second pass discards targets of low confidence (e.g. FDR>50%) from the previous round

Questions to answer

Before training

[ ] (Katja) What does the UMAP of features look like? Does it show clear differentiation of target & decoy? Do clusters overlap between datasets?
[ ] (Katja) What do the distributions of target/decoy metrics look like, and how much to they differ between datasets?

During evaluation

[ ] (Theo) How important is each MSM feature?
[ ] What is the best combination of metrics to use?
[ ] What is the performance (i.e. time required to calculate metrics) like?
[ ] Does "dynamic imaging" help?
[ ] (Theo, if possible before OurCon) Is there an advantage from using lab/sample-specific models?
[ ] (Andreas) Is there an advantage to using a custom model for SpaceM datasets?
[ ] (Katja) How stable are the predictions across the different decoys?
[ ] How many decoys are actually needed for accurate results?
[ ] (Theo) Does the model still achieve linearity in the low FDR ranges?
[ ] (Theo) Which molecules have the biggest improvement/regression (share the top 10 of each)?
[ ] Which datasets have the biggest improvement/regression? Any trends?
[ ] How much benefit comes from using the old/new scoring model, vs the old/new FDR algorithm?
[ ] Is full recalibration beneficial? Can "inline recalibration" (doing RANSAC on the absolute mass error metric) be used instead?
[ ] (Katja) How much benefit is there to using a complex model? Use Random Forest or SVM as a baseline as they're well understood.
[ ] Does CatBoost perform as well as previously cited models RankBoost and gboosting (used in this paper)?

Post-evaluation

Katja suggested that an analysis between replicate datasets similar to the one in "Data-driven rescoring of metabolite annotations significantly improves sensitivity" would be useful. It would be good to explore this idea further if the new algorithm is published.
Shap seems like a good way to explain the results for the diagnostics section

metaspace2020 / metaspace

ML-based annotation scoring #797