mlr-org / mlr3benchmark

Analysis and tools for benchmarking in mlr3 and beyond.
https://mlr3benchmark.mlr-org.com/
GNU Lesser General Public License v3.0
12 stars 2 forks source link

Implement a `half-way' score method or dispatch for BenchmarkResult #1

Open RaphaelS1 opened 4 years ago

RaphaelS1 commented 4 years ago

Currently there are two scoring methods in BenchmarkResult:

For some BenchmarkResult object called bmr:

  1. bmr$score() - Returns the aggregated scores for every fold
  2. bmr$aggregate() - Returns the aggregated scores aggregated over each fold

The problem is that currently no mid-point is supported due to how measures in mlr3measures are implemented. e.g. final line of logloss:

-mean(log(p))

The mean is hardcoded into the equation. This is generally a problem and doesn't allow easy support for standard errors or examining residuals for an individual prediction.

Therefore this issue would depend on a restructuring of scores. One suggestion would be as follows, using logloss as example:

Have a classif.logloss class with three methods:

Alternatively, something like a class with one method with options:

Anyway this is outside remit of this package but would be required for Wilcox pairwise tests and other comparisons that look at all residuals. And a decision that probably has to be made by @mllg or @berndbischl

fkiraly commented 4 years ago

First a comment about your suggestion: I think it makes a lot of sense, since the individual predictions, loss evalutes, and/or residuals are needed for some post-hoc analyses. Even if this is not directly called by the user, there should be some way to get these via the measure rather than manually.

There are, though, two issues I think we should discuss:

fkiraly commented 4 years ago

Second, I think there is an interesting distinction which we thought a little about with mlaut (which is a little like mlr3benchmark for python), see also the paper about it.

The problem is: some meaures do not first compute individual losses/utilities, then aggregate. For example, classification auroc, concordance index, or F1 cannot be written as an aggregate of individual samples by an aggregation function; and for sensitivity/specificity or RMSE the aggregation function is odd.

Conceptually, you have aggregate measures (which take test predictions and observations as input) and sample-level measures (they take a single prediction and obseration as input). Formally, you can apply a compositor (the aggregation mode) to the latter to create one of the former.

One question then is: since this is mathematically different, should that not be two different kinds of object?

RaphaelS1 commented 4 years ago

what would you do if you would like to return two closely related aggregations together, e.g., return RMSE together with a standard error estimate? I think (but one might disagree) that the user should be able to provide their own function and have it wrapped as an aggregator.

Each measure contains an aggregator field where users can supply their own aggregator (https://mlr3.mlr-org.com/reference/Measure.html). However this is not an aggregation on the level of individual predictions but aggregation across folds.

mllg commented 4 years ago

It would be relatively easy to allow measures to return (numeric) vectors and introduce a second aggregation function operating on lists of such vectors. Would that help?

RaphaelS1 commented 4 years ago

I assume its easy to implement this as a possible return type but not to actually go back and change all implemented measures from automatically returning aggregated scores?