Unable to reach the target coverage for the Jackknife after bootstrap method even on the exoplanet notebook

AlexisVignard-hub commented 11 months ago

Hello

I've tried to use the Jackknife after bootstrap method on different datasets, and I can't reach the target coverage (i.e., 0.95 for an uncertainty specified at 0.05 with the alpha parameter). At first I thought that there must be a problem with my data (violation of the exchangeability assumption, for example), or that I'd made a mistake somewhere in my code.

I then tried to reproduce the exact same example presented on MAPIE for the exoplanet dataset by downloading the notebook and the data available here: https://github.com/scikit-learn-contrib/MAPIE/blob/master/notebooks/regression/exoplanets_mass.csv (same data, same model), and indeed the coverage is approximately 0.8 for most alpha thresholds and not 0.95 as indicated on the example.

I'd like to know if other people have encountered this problem, and if it's due to an error on my part or a problem in the implementation of the MAPIE method?

To Reproduce this problem, you only have to use Jackknife after bootstrap method on any dataset and display the coverage of the intervals. You can also check the official MAPIE notebook https://github.com/scikit-learn-contrib/MAPIE/blob/master/notebooks/regression/exoplanets.ipynb , download and run it with the exoplanet data (sent on the link above) to compare how Jackknife after bootstrap differs from the notebook and your results.

Here is the official coverage plot, where you can see Jackknife after bootstrap in purple (also available at the bottom of the official notebook) which reach the target coverage:

Here is what I obtain by downloading and running this exact same notebook:

Best regards, Alexis

thibaultcordier commented 11 months ago

Hello @AlexisVignard-hub, thank you for reporting this bug. I will investigate the issue as soon as possible and get back to you to explain the reasons for your observations.

thibaultcordier commented 11 months ago

TL;DR: The jackknife method after bootstrap is based on the SubSample class which, with the default parameters, generates too little calibration data because a lot of data is common to all the sub-samples. A corrective solution to the problem:

"jackknife_plus_ab": Params(method="plus", cv=Subsample(n_resamplings=5, n_samples=int(len(X_train)/10)))

Bug explained

The jackknife method after bootstrapping cannot achieve the target coverage in certain specific configurations. It is not clear whether the problem is related to the data, the method, or the parameters used such as random state or default parameters.

In particular, the problem can be reproduced in the exoplanets notebook in the MAPIE documentation.

Investigation

I confirm that the jackknife method after bootstrap is under-covered in the notebook above and that the problem is therefore not due to the reporter.

Furthermore, the problem is not systematic. As an example, see the following tutorial in which the method reaches the target coverage without any problem.

As all the cross conformal methods share the same structure in the library, the problem seems to be linked to the SubSample class, which is the main difference from the other standard methods.

To be more precise, here is the set of parameters used to instantiate the jackknife method after bootstrap:

"jackknife_plus_ab": Params(method="plus", cv=Subsample(n_resamplings=5)

The reason for the bug was detected in the log by deleting the following line of code: warnings.filterwarnings("ignore").

mapie/utils.py:453: UserWarning: WARNING: at least one point of training set belongs to every resamplings.
Increase the number of resamplings

Here we understand that the method with the default parameters generates too little calibration data because a lot of data is common to all the sub-samplings (behind "at least" hides a significant number of data points). To summarise, in the cross-conformal method, the data is sampled several times in the subsamples for training and reused as calibration data for the trained model that did not use it during training. If a data point is found in all the subsamples, it cannot be used as calibration data.

In our case, a significant number of data points cannot be saved as calibration data, which deteriorates the quality of the estimate of the distribution of non-conformity scores. This side effect explains why the method has a low effective coverage compared with the target coverage.

Remedy

The proposed solution to avoid this side effect is to reduce the number of data points in each sub-sample to ensure that all data points can be used as calibration data. This can be done using the following set of parameters:

"jackknife_plus_ab": Params(method="plus", cv=Subsample(n_resamplings=5, n_samples=int(len(X_train)/10)))

Here the final results:

Conclusion and outlook

To conclude, this bug is based on a side effect produced by the SubSample class, which manages the jackknife after bootstrap sampling. The default parameters do not provide sufficient calibration data and therefore produce under-covered intervals. A simpler solution is to manually choose a more satisfactory number of samples per group, as proposed below.

To avoid confusion in the future, it will be appropriate to adapt the exoplanet notebook with a better instantiation of SubSample.

AlexisVignard-hub commented 11 months ago

Hello @thibaultcordier, Thank you very much for your very detailed answer.

The proposed solution does indeed work, and I would like to know if any research has been carried out on the subject of the size of these sub-samples, as well as their number.

The exoplanet dataset contains approximately 2700rows, and here dividing the training set by 10 seems efficient. But what about much larger data sets and much smaller data sets?

From the few tests I've been able to carry out, it seems that dividing the training set by 10 is an acceptable rule to reach coverage, but for smaller dataset the intervals tends to be too conservative. Then, I'd like to know if your choice of dividing the training set by 10 comes from a theoretical basis, or if it is more of an appropriate choice to the exoplanet dataset?

Best regards Alexis

thibaultcordier commented 11 months ago

The Jackknife after bootstrap method was proposed in the following paper: Kim, B., Xu, C., & Barber, R. (2020). Predictive inference is free with the jackknife+-after-bootstrap. Advances in Neural Information Processing Systems, 33, 4138-4149.. We drew inspiration from this article to implement it in MAPIE.

Unfortunately, I don't have any references to share with you (or at least I'm not aware of any) regarding the research that has been carried out into the size and number of these sub-samples.

However, the intuition is as follows:

first, let's define n_resamplings the number of resamplings, n_samples the number of samples in each resampling and n_data the total number of examples.
using the standard cross-validation cross-conforming method ("cv_plus": Params(method="plus", cv=n_resamplings)), we divide the training dataset into n_resamplings folds with n_data/n_resamplings samples in each fold.
defining n_samples=n_data/n_resamplings with the jackknife method after bootstrapping ("jackknife_plus_ab": Params(method="plus", cv=Subsample(n_resamplings=n_resamplings, n_samples=n_data/n_resamplings))) gives you behaviour similar to the cross-validation method, but with a replacement of your samples (technically because you're not generating a perfect partition of your training dataset).
if you define n_samples>n_data/n_resamplings, you increase the number of examples shared between all the resamplings and therefore reduce the number of calibration data. You increase the overfitting of your model on training/calibration data, which deteriorates the quality of the estimate of the distribution of non-conformity scores (but the good thing is that you increase the performance of your predictive models because you use more data for training).
If you define n_samples<n_data/n_resamplings, you increase the number of examples which are not in any resamplings and therefore these data will be systematically used as calibration data and thus avoid over-fitting, which improves the quality of the estimate of the distribution of non-conformity scores (but deteriorates the performance of your predictive model).

So it's a trade-off between the performance of your predictive model and the quality of your model calibration.

My choice to divide the training set by 10 comes from an intuitive/practical basis, and is therefore more appropriate to the exoplanet dataset (here, I choose n_resamplings=5, and n_samples=n_data/(2*n_resamplings)) to increase the amount of calibration data not used for training).

I hope I've been able to give you a satisfactory answer to your question.

Best regards, Thibault

AlexisVignard-hub commented 11 months ago

Thank you for your answer! This is very clear and helpful.

Best regards, Alexis

scikit-learn-contrib / MAPIE