kamilest / meds-evaluation

4 stars 0 forks source link

Uncertainty estimate #3

Open Jeanselme opened 3 weeks ago

Jeanselme commented 3 weeks ago

Do we want bootstrapped estimates of uncertainty for the different metrics? If so, how do we want to compute it for the different curves ?

mmcdermott commented 2 weeks ago

I think we need to be very careful about using bootstrapping in this package to estimate uncertainty.

In this package, we only have access to a fixed set of predictions on a fixed test set.

We can resample from that test set with replacement to assess what impact the test set selection has on the final model performance, but given that this is for a generalized benchmarking setting, not a true deployment of a fixed, frozen model, I think such estimates will significantly underestimate performance variance because it will not take into account the variance caused by sampling different training populations and having different random initializations or processes during model training.

For example, suppose our metric were accuracy. As accuracy is just the expectation that a random draw from the test set will have a correct prediction, bootstrapping over a fixed set of test set predictions will do nothing for us other than simulate taking the expectation of the rate of a binomial random variable.

In reality, as users of this benchmark will always be imagining taking these results and applying them on their local datasets, the variance those users should expect will be much higher as it will take into account resampling the training set, re-training the model, etc.

mmcdermott commented 2 weeks ago

Tagging @EthanSteinberg and @kamilest for discussion as well