metaspace2020 / metaspace

Cloud engine and platform for metabolite annotation for imaging mass spectrometry
https://metaspace2020.eu/
Apache License 2.0
48 stars 10 forks source link

ML-based annotation scoring #797

Closed LachlanStuart closed 2 years ago

LachlanStuart commented 3 years ago

~Depends on #796~ (Recalibration has advanced enough that this is no longer blocked) Closes #147

Plan:

Algorithm changes

Points marked (non-core) are not a vital part of the project, but may reveal easy wins. They can be skipped if implementation would take too much time.

Pre-annotation

Image/metric sets

For every batch of images for which metrics will be calculated, the following combinations of images should have metrics calculated:

Metrics

Derived metrics

(i.e. metrics that can be calculated outside of the annotation pipeline based on the above)

Other pipeline changes

Evaluation

Test datasets

Evaluation criteria

Through discussion it has been clear that there are multiple angles from which the model needs to be validated, and no individual metric seems to satisfy all of them:

Evaluating overfitting / memorization

These should be done based on the F1 score (or similar metric) on the output of the model, as the later FDR process would likely "smooth over" these negative effects.

Many models have built-in overfitting protection, but it's also possible to measure overfitting generally by comparing the model's accuracy between train & validate sets.

As cross-validation will be used for the optimization and evaluation, the existence of overfitting does not compromise the analysis. The value of detecting overfitting is that it signals when the model is performing sub-optimally, i.e. that the validation set's score might be improvable by taking more steps to reduce overfitting.

When there are sparse areas of feature space, decision tree models can start to memorize data points even in early epochs before the learning curve actually shows

Evaluating input bias

IMO this is one of the bigger areas of risk. If the model can't generalize, it can't safely be deployed for general-purpose use across METASPACE.

Evaluating model performance

It's unclear yet which metric will be best for representing the objective we want to optimize, so try all of these and check the results:

The measure should prioritize both the number of annotations and how low their FDR%s are, e.g. 10 annotations at FDR<=5% is preferable to having 12 at FDR<=20%.

Evaluating "Success" / marketing the model to users

The stated objective is "90% of datasets get at least 10% increase of annotations at FDR<= 10%.". For advertising, we should keep that format and find the most appealing combination of %s.

Training

Several aspects need to be optimized:

Questions to answer

Before training

During evaluation

Post-evaluation

LachlanStuart commented 3 years ago

@intsco I think you clicked the wrong button and closed this issue. If you're getting notification spam, unsubscribe with the button on the right sidebar:

image

LachlanStuart commented 2 years ago

The majority of this was implemented in #1019 .

I've split the remaining coding work into a separate task ( #1062 ) so that this huge issue can be closed.