scikit-learn-contrib / MAPIE

A scikit-learn-compatible module to estimate prediction intervals and control risks based on conformal predictions.
https://mapie.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
1.25k stars 102 forks source link

MapieRegressor with prefit optimized model that used training and calibration data #477

Closed ramy90 closed 1 week ago

ramy90 commented 2 months ago

Hi, I would like to use a pretrained xgboost regression model as a "prefit" estimator in MapieRegressor. However, a K-fold cross-validation strategy was already used while training and hyperparameter tunning of the xgboost model. This means, there is no calibration dataset that the model didn't see before. Will there be a data leakage if I use all the training data (that was used in training and optimizing the xgboost model) as X_train in MapieRegressor?

In the regression example provided https://github.com/scikit-learn-contrib/MAPIE/blob/044ae6977a7ed874686b78e278f0e9b433cb2f65/examples/regression/4-tutorials/plot_cqr_tutorial.py#L277 , only training data (X_train) was used and no calibration data was used (X_calib) to fit the MapieRegressor. Why was the calibration data excluded for MapieRegressor in this example?

Also, if I decide to fit the model anyways using the following code: mapie_xgb = MapieRegressor(xgb_model, cv='prefit') mapie_xgb.fit(X=X_train_transformed, y=y_train_transformed) I get the following error: ValueError: The two functions get_conformity_scores and get_estimation_distribution of the ConformityScore class are not consistent. The following equation must be verified: self.get_estimation_distribution(X, y_pred, self.get_conformity_scores(X, y, y_pred)) == yThe maximum conformity score is 9.5367431640625e-07.The eps attribute may need to be increased if you are sure that the two methods are consistent.

LacombeLouis commented 2 months ago

Hey @ramy90,

Split train and calib data

Indeed, as explained in the documentation (and as mentioned in: Barber, Rina Foygel, et al. "Predictive inference with the jackknife+." (2021): 486-507.), if you use values that have already been from your X_train into your calibration, you will end up with prediction intervals that likely overfit and thereby get a lower coverage than expected. I would suggest trying to keep a separate dataset from your training.

CQR tutorial

In regard to the issue in the cqr_tutorial, there is indeed a small mistake. We should use the full dataset, I will provide a fix by simply not split into a X_train and X_calib as the classes take care of that.

Consistency issue ConformityScore

With regard to your last issue, could you provide more information / share your code. I have provided an example of a working version with an xgb_model:

from mapie.regression import MapieRegressor
from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(
    n_samples=500, n_features=10, noise=1.0, random_state=1
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train, X_calib, y_train, y_calib = train_test_split(X_train, y_train, test_size=0.5, random_state=42)

xgb_model = XGBRegressor()
xgb_model.fit(X_train, y_train)

mapie_xgb = MapieRegressor(xgb_model, cv='prefit')
mapie_xgb.fit(X=X_calib, y=y_calib)

mapie_xgb.predict(X_test, alpha=0.1)

Note that you can also fix the consistency issue through this prior issue #321.

ramy90 commented 2 months ago

Hello @LacombeLouis

Thank you so much for your explanation, it helped a lot! Following the solution in #321 solved the consistency issue.

I would suggest trying to keep a separate dataset from your training

If I understood you correctly, the modeling can be broken into the following steps:

  1. Split the dataset into training, calibration and testing
  2. Define and fit the regression model (In my case: XGBoost regressor) using the training data
  3. Optimize the model's hyperparameters using the training data only (In my case: using K-fold cross-validation)
  4. Define MapieRegressor with the pre-fitted regression model while choosing the cv strategy to be prefit
  5. Fit the MapieRegressor using the calibration data, as you kindly mentioned in the example above.
  6. Predict and test the results of MapieRegressor using the testing data

What still confuses me a bit is the following: Let's say that after evaluating the coverage and range width, I am happy with the results. Now I would like to finalize the model to deploy it and I want to take advantage of the whole dataset that I have. Can I use the training and testing data to train base model (XGBoost) and the calibration data to fit the MapieRegressor ? What is the best way to best utilize the whole dataset?

LacombeLouis commented 2 months ago

Hey @ramy90,

I believe these modeling steps explain quite well what needs to be done. Indeed, in a “production” environment, you would probably want to use the full potential of your data. This means that you would want to use most, if not all, of your training data. To achieve this, you would likely need to fit your estimator in MAPIE using a CV+ method, which will treat the current data you have as separate sets for training and calibration, but use cross-validation in order to utilize your entire dataset.

If your question lies rather in the fusion of train + test dataset, then the arbitration is unfortunately outside the scope of MAPIE. In machine learning, you're not supposed to touch the test dataset, your testing set would still need to be kept separate.

Note: ensure that your data is homoscedastic (one of the key assumptions for conformal predictions).

I hope this answers your question.

deepakorani commented 1 month ago

@LacombeLouis I had a similar issue when I was setting up mapie, I am not sure the best way to use these methods.

@ramy90 comments on how to use [MAPIE] (https://github.com/scikit-learn-contrib/MAPIE/issues/477#issuecomment-2211811563)) has helped but I am still unclear how to set this up in production. Also if we do split into train test and calibration we have severely less training data to fit the model on. Especially tough in drug discovery tasks.

My workflow is similar to @ramy90, I was curious if we can remove k fold cross validation and instead fit our training mapie data for cv+ approach, but then I was not sure if we are leaking information into our test set, when we predict our test prediction interval.

More examples could help, especially in the binary regime for set prediction if you can add imbalanced data use cases as real world data in most cases is never well distributed.