mattwarkentin / ordered

Bindings for ordinal classification models for use with the 'parsnip' package, such as ordinal random forests by Hornung R. (2020) <doi:10.1007/s00357-018-9302-x> and others.
Other
6 stars 1 forks source link

enable numeric and ordinal metrics for ordinal outcome models #7

Open corybrunson opened 3 weeks ago

corybrunson commented 3 weeks ago

Recently, Sakai (2021) compared several class, numeric, and proposed "ordinal" performance measures/metrics on ordinal classification tasks. This raises the questions of (1) what performance measures {yardstick} should make available for ordinal classification models and (2) how to harmonize this decision with package conventions. I don't know what challenges (2) would pose, and anyway they will depend on (1).

I think it's necessary to make measures available that are specifically designed for ordinal classification, in part because there are serious, though separate, theoretical problems with using class and numeric measures. That said, i think there are also good reasons to make both class and numeric measures available:

  1. Commensurability: Compare results to previous work that used class or numeric measures.
  2. Benchmarking: Measure the comparative advantage of using ordinal measures.
  3. Model selection: Assess whether a nominally ordinal outcome can be treated as categorical or integer-valued (for reasons, e.g., of tractability or interpretation).

Because metric_set() (understandably) refuses to mix numeric and class measures, perhaps this would be best achieved by allowing ordinal_reg() and (its and other) ordinal engines to also play in 'regression' mode, while the specifically ordinal measures could require (else error) or expect (else warning) that the outcome is ordered, that the model type or engine is ordinal, or that some other check is passed.

This would unavoidably enable bad practice, but it's bound to come up, and i think it deserves consideration.

topepo commented 2 weeks ago

These should all be in yardstick. I've made an issue for ranked probably scores, which I favor.

I've read the Sakai paper(s), and they seem to think that probabilistic predictions do not exist.

TBH, everything else that I've seen is problematic in a variety of ways. MSE/MAE/RMSE based on predicted class "distances" are things that we can estimate, but I would not want to rely on them. If we use a class-based metric, I would choose Kappa or alpha or one of the others that have been studied and vetted for decades.

A lot of the metrics I see in the CS papers seem poorly motivated, and I get the sense that they've never looked into the massive amounts of prior art on the subject.