[Feature Request] Allow the user to collect_metrics exactly instead of simply averaging

BenoitLondon commented 1 year ago

copied from https://community.rstudio.com/t/averaging-metrics-on-different-size-samples-is-wrong/159199

When ranking the models with tune package, the metrics are averaged over the samples assuming equal size (and linearity).

That's ok for equal size slices and linear metrics like MSE or MAE but it s dubious for unequal sizes and/or non linear metrics like AUC.

Say I do time-walk forward splits by month and my last slice of data has much less data in the validation set, it will benefit to the lucky models that performed well on this last data instead of properly weighting by the slice size for example.

I think the averages computed by collect_metrics should be at least weighted by the size of the sample. (at best recomputed from the predictions over the union of all validation sets.)

A direct improvement would be to weight the metrics by the slice validation sizes, that would at least fix the linear metric case like mse and mae, but for the non-linear metrics we could be allowed to recompute them on all the predictions from the validation sets with some control parameter for example like use_exact_summary = FALSE by default, which when TRUE would recompute the metric on the full validation set.

EmilHvitfeldt commented 1 year ago

Hello @BenoitLondon 👋 Thank you for your interest in this project!

Would you mind describing in a little more detail how you imagine that one would weight the metrics based on slice size. I would point out that the averaging being done on the metric is across the resamples.

BenoitLondon commented 1 year ago

Hi @EmilHvitfeldt!

Let me link two web pages about the issue First: this overall definition of cv error

 When you finish fitting and scoring for all k versions of the training and validation data sets,
 you will obtain holdout predictions for all of the observations in your original training data. 
The average squared error between these predictions and the true observed response is the cross validation error.

And secondly: an example for calculating overall error in k-fold cross validation

Here it is quickly explained that simple average works only for equal sizes and linear metrics like mse/mae/logscore etc For example, in the case of RMSE (which is not linear!) you would need to compute the RMSE of the RMSE's to recover the full RMSE! (let's call this poolable) With more complex metrics like AUC, I doubt there is a way to get the total AUC from the individual folds AUC (let's call this non-poolable), so in this case keeping all predictions would be needed in the exact case I mention above.

So to summarize and answer your question, there are 5 cases I guess:

1) equal size and linear (mae/mse/...) -> simple averaging is correct 2) equal size and poolable (i.e can be transformed to the linear case) (e.g rmse) -> simple averaging with the correct transformation of each fold metric 3) non equal fold size and linear (mae/mse/...) -> weighted averaging is correct (with the weight the proportion of cases in that fold compared to the total size of the validation set) 4) non equal size and poolable (rmse) -> weighted averaging with the correct transformation of each fold metric 5) non-poolable metric (AUC?) -> need full predictions on each fold to compute correct overall metric, averaging is incorrect and can lead to wrong ordering of models especially if non equal size

Note 1: handling case weights would be handy too. Note 2: keeping predictions and compute the overall metric works in all cases and so is probably my prefered solution.

BenoitLondon commented 1 year ago

Another better explanation is there https://docs.h2o.ai/h2o/latest-stable/h2o-docs/cross-validation.html

BenoitLondon commented 1 year ago

Actually, I can see reasons where averaging is useful compared to full computation on all validation predictions:

1) models are differents on each validation set so it doesn't make real sense to pool predictions together before computing the metric 2) it allows to compute standard deviation of the metric which in turns allow to choose a more conservative value of the hyper parameters (like lambda.1se in glmnet for example)

Though the different sizes are still an issue.

Also in the extreme case of leave out one cross validation with metric AUC, averaging makes no sense as all metrics would be 1 for any model!

topepo commented 1 year ago

Is this a reasonable thing to ask for? Yes, I think that we can easily implement this.

However:

it s dubious for unequal sizes

Yes, if you are using sliding windows and some windows are disproportionally small compared to others, this is a sensible thing to be able to do.

However, your comment is untrue as a blanket statement.

For example, almost every theoretical reference that shows the overall resmapling estimate defines it for a general loss function. There isn't any issue with linear or nonlinear metrics. For example, ESL (2009, printing 12) has

for cross-validation and

for the bootstrap. The latter is likely to have the most variation in the number of holdout predictions.

If these estimators required weighting, the 1/N terms would have to come inside of the left-most summations.

As another data point, Efron and Hastie (2021, section 12.2) show an example of using the bootstrap with equal weighting and a nonlinear metric (R²). The creator of the bootstrap doesn't appear to have any issues here.

Again, it is nice to have as a software feature but equal averaging does not constitute an error across all resampling methods.

We will not be making changes to the code to compute metrics based on pooling across holdout sets. You can use the tidyposterior package to get statistically valid estimates using Bayesian partial pooling but, otherwise, we are going to default to computing individual resampling estimates and averaging them.

Also in the extreme case of leave out one cross validation with metric AUC, averaging makes no sense as all metrics would be 1 for any model!

We deliberately do not enable LOOCV as an option that can be used in tidymodels generally. For example, the functions in the tune package do not allow the use of LOOCV. That is because of the poor statistical properties of those estimates (which are not averages of N single point metrics).

So, to be clear, we can add a feature to collect_metrics(weighted = TRUE) that will use the size of each holdout set when computing overall resampling estimates. Here's a reprex illustrating it:

library(tidymodels)


tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)


set.seed(121)
train_dat <- sim_regression(200)
rs <- bootstraps(train_dat, times = 50) 

ls_res <- 
  linear_reg() %>% 
  fit_resamples(outcome ~ ., resamples = rs)

collect_metrics(ls_res)
#> # A tibble: 2 × 6
#>   .metric .estimator    mean     n std_err .config             
#>   <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   21.2       50 0.313   Preprocessor1_Model1
#> 2 rsq     standard    0.0958    50 0.00714 Preprocessor1_Model1

# weighted: 
ls_res %>% 
  mutate(
    n_holdout = map_int(splits, ~ nrow(assessment(.x))),
    weights = 1 / n_holdout,
  ) %>% 
  select(.metrics, n_holdout, weights) %>% 
  unnest(cols = .metrics) %>% 
  group_by(.metric) %>% 
  summarize(
    wt_estimate = weighted.mean(.estimate, weights),
    .groups = "drop"
  )
#> # A tibble: 2 × 2
#>   .metric wt_estimate
#>   <chr>         <dbl>
#> 1 rmse        21.2   
#> 2 rsq          0.0955

^{Created on 2023-02-07 by the reprex package (v2.0.1)}

BenoitLondon commented 1 year ago

Thanks Max (@topepo) for taking the time to reply, it all makes sense and as I said earlier I think it 's useful to compute the statistics of the loss across the folds (mean and sdev). Just in my case, I had a very small fold and I needed to handle the size with the weight, the issue being that if I use integrated code (where I cant modify the call to collect_metrics like workflow tune etc) it gives me potentially wrong answers.

simonpcouch commented 1 year ago

In light of the above discussion, I'm going to go ahead and close. This thread will continue to be indexed in search, so if we see an uptick in 👍 in the future we can revisit this discussion. Thanks, yall!

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / tune

[Feature Request] Allow the user to collect_metrics exactly instead of simply averaging #607