Closed LachlanStuart closed 2 years ago
@intsco I think you clicked the wrong button and closed this issue. If you're getting notification spam, unsubscribe with the button on the right sidebar:
The majority of this was implemented in #1019 .
I've split the remaining coding work into a separate task ( #1062 ) so that this huge issue can be closed.
~Depends on #796~ (Recalibration has advanced enough that this is no longer blocked) Closes #147
Plan:
Algorithm changes
Points marked (non-core) are not a vital part of the project, but may reveal easy wins. They can be skipped if implementation would take too much time.
Pre-annotation
Decoys / isotope generation
Image/metric sets
For every batch of images for which metrics will be calculated, the following combinations of images should have metrics calculated:
Metrics
Spectral
Spatial
Chaos
nlevels
can be reducedlabel
functions are too slow. I've also added a number of potential improvements, each as separate metrics so that they can be trialed on a larger scale:Other stats
Derived metrics
(i.e. metrics that can be calculated outside of the annotation pipeline based on the above)
Other pipeline changes
Evaluation
Test datasets
Evaluation criteria
Through discussion it has been clear that there are multiple angles from which the model needs to be validated, and no individual metric seems to satisfy all of them:
Evaluating overfitting / memorization
These should be done based on the F1 score (or similar metric) on the output of the model, as the later FDR process would likely "smooth over" these negative effects.
Many models have built-in overfitting protection, but it's also possible to measure overfitting generally by comparing the model's accuracy between train & validate sets.
As cross-validation will be used for the optimization and evaluation, the existence of overfitting does not compromise the analysis. The value of detecting overfitting is that it signals when the model is performing sub-optimally, i.e. that the validation set's score might be improvable by taking more steps to reduce overfitting.
When there are sparse areas of feature space, decision tree models can start to memorize data points even in early epochs before the learning curve actually shows
Evaluating input bias
IMO this is one of the bigger areas of risk. If the model can't generalize, it can't safely be deployed for general-purpose use across METASPACE.
Evaluating model performance
It's unclear yet which metric will be best for representing the objective we want to optimize, so try all of these and check the results:
The measure should prioritize both the number of annotations and how low their FDR%s are, e.g. 10 annotations at FDR<=5% is preferable to having 12 at FDR<=20%.
Evaluating "Success" / marketing the model to users
The stated objective is "90% of datasets get at least 10% increase of annotations at FDR<= 10%.". For advertising, we should keep that format and find the most appealing combination of %s.
Training
Several aspects need to be optimized:
Which model to be used
Which features to use
Which model hyperparameters to use
[ ] Lucas suggests using something like grid search for hyperparameter optimization
[ ] Theo suggests trying out two-pass training, where the second pass discards targets of low confidence (e.g. FDR>50%) from the previous round
Questions to answer
Before training
During evaluation
Post-evaluation