Closed ramy90 closed 1 week ago
Hey @ramy90,
Indeed, as explained in the documentation (and as mentioned in: Barber, Rina Foygel, et al. "Predictive inference with the jackknife+." (2021): 486-507.), if you use values that have already been from your X_train
into your calibration, you will end up with prediction intervals that likely overfit and thereby get a lower coverage than expected. I would suggest trying to keep a separate dataset from your training.
In regard to the issue in the cqr_tutorial, there is indeed a small mistake. We should use the full dataset, I will provide a fix by simply not split into a X_train
and X_calib
as the classes take care of that.
ConformityScore
With regard to your last issue, could you provide more information / share your code. I have provided an example of a working version with an xgb_model
:
from mapie.regression import MapieRegressor
from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(
n_samples=500, n_features=10, noise=1.0, random_state=1
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train, X_calib, y_train, y_calib = train_test_split(X_train, y_train, test_size=0.5, random_state=42)
xgb_model = XGBRegressor()
xgb_model.fit(X_train, y_train)
mapie_xgb = MapieRegressor(xgb_model, cv='prefit')
mapie_xgb.fit(X=X_calib, y=y_calib)
mapie_xgb.predict(X_test, alpha=0.1)
Note that you can also fix the consistency issue through this prior issue #321.
Hello @LacombeLouis
Thank you so much for your explanation, it helped a lot! Following the solution in #321 solved the consistency issue.
I would suggest trying to keep a separate dataset from your training
If I understood you correctly, the modeling can be broken into the following steps:
MapieRegressor
with the pre-fitted regression model while choosing the cv
strategy to be prefit
MapieRegressor
using the calibration data, as you kindly mentioned in the example above.MapieRegressor
using the testing dataWhat still confuses me a bit is the following:
Let's say that after evaluating the coverage and range width, I am happy with the results. Now I would like to finalize the model to deploy it and I want to take advantage of the whole dataset that I have. Can I use the training and testing data to train base model (XGBoost) and the calibration data to fit the MapieRegressor
? What is the best way to best utilize the whole dataset?
Hey @ramy90,
I believe these modeling steps explain quite well what needs to be done. Indeed, in a “production” environment, you would probably want to use the full potential of your data. This means that you would want to use most, if not all, of your training data. To achieve this, you would likely need to fit your estimator in MAPIE using a CV+ method, which will treat the current data you have as separate sets for training and calibration, but use cross-validation in order to utilize your entire dataset.
If your question lies rather in the fusion of train + test dataset, then the arbitration is unfortunately outside the scope of MAPIE. In machine learning, you're not supposed to touch the test dataset, your testing set would still need to be kept separate.
Note: ensure that your data is homoscedastic (one of the key assumptions for conformal predictions).
I hope this answers your question.
@LacombeLouis I had a similar issue when I was setting up mapie, I am not sure the best way to use these methods.
@ramy90 comments on how to use [MAPIE] (https://github.com/scikit-learn-contrib/MAPIE/issues/477#issuecomment-2211811563)) has helped but I am still unclear how to set this up in production. Also if we do split into train test and calibration we have severely less training data to fit the model on. Especially tough in drug discovery tasks.
My workflow is similar to @ramy90, I was curious if we can remove k fold cross validation and instead fit our training mapie data for cv+ approach, but then I was not sure if we are leaking information into our test set, when we predict our test prediction interval.
More examples could help, especially in the binary regime for set prediction if you can add imbalanced data use cases as real world data in most cases is never well distributed.
Hi, I would like to use a pretrained xgboost regression model as a "prefit" estimator in MapieRegressor. However, a K-fold cross-validation strategy was already used while training and hyperparameter tunning of the xgboost model. This means, there is no calibration dataset that the model didn't see before. Will there be a data leakage if I use all the training data (that was used in training and optimizing the xgboost model) as X_train in MapieRegressor?
In the regression example provided https://github.com/scikit-learn-contrib/MAPIE/blob/044ae6977a7ed874686b78e278f0e9b433cb2f65/examples/regression/4-tutorials/plot_cqr_tutorial.py#L277 , only training data (X_train) was used and no calibration data was used (X_calib) to fit the MapieRegressor. Why was the calibration data excluded for MapieRegressor in this example?
Also, if I decide to fit the model anyways using the following code:
mapie_xgb = MapieRegressor(xgb_model, cv='prefit')
mapie_xgb.fit(X=X_train_transformed, y=y_train_transformed)
I get the following error: ValueError: The two functions get_conformity_scores and get_estimation_distribution of the ConformityScore class are not consistent. The following equation must be verified: self.get_estimation_distribution(X, y_pred, self.get_conformity_scores(X, y, y_pred)) == yThe maximum conformity score is 9.5367431640625e-07.The eps attribute may need to be increased if you are sure that the two methods are consistent.