related-sciences / nxontology-ml

Machine learning to classify ontology nodes
Apache License 2.0
6 stars 0 forks source link

Good metrics for model evaluation? #9

Closed yonromai closed 1 year ago

yonromai commented 1 year ago

Hi!

Context

Before proceeding with more work on modeling & features, I think it would be useful to make sure that we are tracking good metrics to evaluate the output of the models.

I my last PR, I introduced basic canonical multi-class metrics (Sklearn's classification_report ) using the labelled dataset (efo_otar_slim_v3.43.0_rs_classification.tsv):

                    precision    recall  f1-score   support

01-disease-subtype       0.80      0.85      0.82      1123
   02-disease-root       0.72      0.64      0.68       776
   03-disease-area       0.77      0.79      0.78       242
    04-non-disease       0.97      0.99      0.98       918

          accuracy                           0.83      3059
         macro avg       0.82      0.82      0.82      3059
      weighted avg       0.83      0.83      0.83      3059

Question

⇒ Have you done work internally to evaluate classification metrics and how well they correlate to business goals?

eric-czech commented 1 year ago

Have you done work internally to evaluate classification metrics and how well they correlate to business goals?

I'd have a minor preference for ROC, but F1 is ok too (ideally both when not picking one for optimization).

I will say though that I think support-weighted averages are definitely bad here since 03-disease-area is both the least frequent and most important (for us) class to identify correctly. An evenly weighted macro-average is fine for a single score across all classes, though a better weighting IMO is {'03-disease-area': .5, '02-disease-root': .25, '01-disease-subtype': .25}.

The 04-non-disease class should be irrelevant since it's determined by the ontology structure rather than being a part of the training/prediction problem.

dhimmel commented 1 year ago

I prefer macro averaged mean absolute error, where each class could be evenly weighted or even better customizeable like @eric-czech suggested.

I'm assuming the predicted classes are fuzzy (i.e. probabilities rather than binary) and sum to 1 for a given sample across all classes.

References all in the context of ordinal classification instead of nominal/categorical classification:

Not sure if the sklearn mean_absolute_error function can work with categorical. I've only ever used this metric for binary classification.

Here's how I envision the computation working for a specific sample:

Class Truth Prediction Abs Error
01-disease-subtype 1 0.6 0.4
02-disease-root 0 0.3 0.3
03-disease-area 0 0.1 0.1
04-non-disease 0 0.0 0.0

Then we average the absolute error for all samples within each class. Then we average the within-class mae by the macro-weights to get the single output score.

The score is between 0 for perfect and 1 for exactly opposite. I like this metric because it accounts for how off a prediction rather than applying only after the predictions are binarized and most information is lost. It's especially important if we have a case where we're interested in fuzzy predictions (for this project, perhaps less so). A mamae of 0.1 would indicate that after our macro-averaging each prediction is 0.1 away from truth.

I'm not sure why it's not the most common metric for all classification problems. Convince me it's bad haha.

That being said, I leave what metric to use as something that is up to you the implementer.

yonromai commented 1 year ago

Cool, that works for me. I'll give it a spin and open a corresponding PR.

support-weighted averages are definitely bad here since 03-disease-area is both the least frequent and most important (for us) class to identify correctly

Great to know, thanks for the insight! This class imbalance has an interesting impact on the model and its business outcome.

@eric-czech Do you think it'd be worth trying to come up with a somewhat solid estimate for the class weights and inject that into the training objective?

I like this metric because it accounts for how off a prediction rather than applying only after the predictions are binarized and most information is lost.

@dhimmel I think this metric makes a lot of sense, I just wonder how correlated it is to a binary error (once averaged over a lot of samples). I don't see how the extra information can hurt, I'll give it a try.

Also, I noticed that this metric is mostly used for ordinal classification.. => Do we have an ordinal relationship between classes here? (e.g. from more to least specific)

I'm not sure why it's not the most common metric for all classification problems. Convince me it's bad haha.

FWIW I could hypothesize why MAE would be less popular as objective functions than binary type losses:

But I agree that it should be a stronger metric in the case where the classes are more fuzzy (unless training speed is more valuable than a better objective function?)

eric-czech commented 1 year ago

@eric-czech Do you think it'd be worth trying to come up with a somewhat solid estimate for the class weights and inject that into the training objective?

Yep, I'm onboard with that. 50% for disease area, 25% for the other two feels right to me.

ordinal classification

That's a great angle here. I always finding it a little frustrating how poor support is for that in most ML libs with canned implementations. If you're interested, that's definitely worth exploring more since it makes interpretability cleaner.

dhimmel commented 1 year ago

Thanks @yonromai for your explanation and thoughts. Regarding the proposed mamae metric, feel free to only use it when reporting on the final models rather than as an objective function.

Do we have an ordinal relationship between classes here? (e.g. from more to least specific)

Hmm. Yes I think we could consider this ordinal. If a term is truthfully high specificity, calling it low specificity would appear to be more egregious of an error than calling it medium specificity. But using a nominal multiclassifier is also okay.

yonromai commented 1 year ago

As per experimental results, it seems like the metrics discussed above are good enough for now? I'm going to close this issue for now. Feel free to re-open if you disagree / have anything else to add.