Open RaphaelS1 opened 4 years ago
First a comment about your suggestion: I think it makes a lot of sense, since the individual predictions, loss evalutes, and/or residuals are needed for some post-hoc analyses. Even if this is not directly called by the user, there should be some way to get these via the measure rather than manually.
There are, though, two issues I think we should discuss:
Second, I think there is an interesting distinction which we thought a little about with mlaut (which is a little like mlr3benchmark for python), see also the paper about it.
The problem is: some meaures do not first compute individual losses/utilities, then aggregate. For example, classification auroc, concordance index, or F1 cannot be written as an aggregate of individual samples by an aggregation function; and for sensitivity/specificity or RMSE the aggregation function is odd.
Conceptually, you have aggregate measures (which take test predictions and observations as input) and sample-level measures (they take a single prediction and obseration as input). Formally, you can apply a compositor (the aggregation mode) to the latter to create one of the former.
One question then is: since this is mathematically different, should that not be two different kinds of object?
what would you do if you would like to return two closely related aggregations together, e.g., return RMSE together with a standard error estimate? I think (but one might disagree) that the user should be able to provide their own function and have it wrapped as an aggregator.
Each measure contains an aggregator
field where users can supply their own aggregator (https://mlr3.mlr-org.com/reference/Measure.html). However this is not an aggregation on the level of individual predictions but aggregation across folds.
It would be relatively easy to allow measures to return (numeric) vectors and introduce a second aggregation function operating on lists of such vectors. Would that help?
I assume its easy to implement this as a possible return type but not to actually go back and change all implemented measures from automatically returning aggregated scores?
Currently there are two scoring methods in
BenchmarkResult
:For some
BenchmarkResult
object calledbmr
:bmr$score()
- Returns the aggregated scores for every foldbmr$aggregate()
- Returns the aggregated scores aggregated over each foldThe problem is that currently no mid-point is supported due to how measures in
mlr3measures
are implemented. e.g. final line of logloss:The mean is hardcoded into the equation. This is generally a problem and doesn't allow easy support for standard errors or examining residuals for an individual prediction.
Therefore this issue would depend on a restructuring of scores. One suggestion would be as follows, using logloss as example:
Have a
classif.logloss
class with three methods:score
- Returns-log(p)
aggr
- Returnsmean(self$score())
se
- Returnssd(self$score())/sqrt(task$nrow)
Alternatively, something like a class with one method with options:
self$score(type = "resid") - Returns
-log(p)` (or maybe 'response')self$score(type = "aggr") - Returns
mean(self$score())` (this would be default)self$score(type = "se") - Returns
sd(self$score())/sqrt(task$nrow)`Anyway this is outside remit of this package but would be required for Wilcox pairwise tests and other comparisons that look at all residuals. And a decision that probably has to be made by @mllg or @berndbischl