scikit-learn-contrib / MAPIE

A scikit-learn-compatible module to estimate prediction intervals and control risks based on conformal predictions.
https://mapie.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
1.2k stars 99 forks source link

Support CatBoost Models with RMSEwithUncertainty loss #423

Closed inti4digbi closed 3 months ago

inti4digbi commented 4 months ago

Describe the bug

CatBoost has the loss_function RMSEWithUncertainty which uses estimated variance for the predictions during training. I have found it to perform better than RMSE on some applications. However, I do not want to use the uncertainty estimates it provides.

When using the models with MAPIE

MapieRegressor(prefitted_model, cv="prefit")
mapie_reg.fit(X_train, y_train)

I get the error message i have paste below. I checked on the code and found the reason for the error. When using the predict method in a models with the RMSEWithUncertainty loss the output has has shape (n_samples,2) with the columns being the predicted mean and predicted variance. MAPIE expects shape (n_samples,).

I can imagine a simple solution where if the output has more than 1 columns then the first one is used as the predicted mean. However, I can see how this could break in cases where the predict methods produce columns arranged in a different manner depending of whatever reason the developers decided.

It would be great if you could support this.

Many thanks in advance

To Reproduce Steps to reproduce the behavior:

  1 print(X_train.shape,y_train.shape)                                                           │
│   2                                                                                              │
│ ❱ 3 mapie_reg.fit(X_train, y_train,alpha=ALPHA)                                                  │
│   4                                                                                              │
│                                                                                                  │
│ /Users/intipedroso/micromamba/envs/datapipeline/lib/python3.10/site-packages/mapie/regression/re │
│ gression.py:541 in fit                                                                           │
│                                                                                                  │
│   538 │   │                                                                                      │
│   539 │   │   # Compute the conformity scores (manage jk-ab case)                                │
│   540 │   │   self.conformity_scores_ = \                                                        │
│ ❱ 541 │   │   │   self.conformity_score_function_.get_conformity_scores(                         │
│   542 │   │   │   │   X, y, y_pred                                                               │
│   543 │   │   │   )                                                                              │
│   544                                                                                            │
│                                                                                                  │
│ /Users/intipedroso/micromamba/envs/datapipeline/lib/python3.10/site-packages/mapie/conformity_sc │
│ ores/conformity_scores.py:205 in get_conformity_scores                                           │
│                                                                                                  │
│   202 │   │   NDArray of shape (n_samples,)                                                      │
│   203 │   │   │   Conformity scores.                                                             │
│   204 │   │   """                                                                                │
│ ❱ 205 │   │   conformity_scores = self.get_signed_conformity_scores(X, y, y_pred)                │
│   206 │   │   if self.consistency_check:                                                         │
│   207 │   │   │   self.check_consistency(X, y, y_pred, conformity_scores)                        │
│   208 │   │   if self.sym:                                                                       │
│                                                                                                  │
│ /Users/intipedroso/micromamba/envs/datapipeline/lib/python3.10/site-packages/mapie/conformity_sc │
│ ores/residual_conformity_scores.py:45 in get_signed_conformity_scores                            │
│                                                                                                  │
│    42 │   │   and the observed ones, from the following formula:                                 │
│    43 │   │   signed conformity score = y - y_pred                                               │
│    44 │   │   """                                                                                │
│ ❱  45 │   │   return np.subtract(y, y_pred)                                                      │
│    46 │                                                                                          │
│    47 │   def get_estimation_distribution(                                                       │
│    48 │   │   self,                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: operands could not be broadcast together with shapes (515,) (515,2) 

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

LacombeLouis commented 3 months ago

Hey @inti4digbi,

Thank you for this issue provided. Indeed, I don't think this is something that could be fixed within our framework. I believe that you could use a class wrapper and re-model the predict function to output only the prediction (checkout #340).

Also, I am not an expert with regard to CatBoost and using the loss_function with RMSEWithUncertainty but if you only plan to use the prediction and not the uncertainty estimates, is that not simply using the loss_function with RMSE? Do not hesitate to ask if you have any further questions,

Thank you!

inti4digbi commented 3 months ago

Thank you

On Thu, Mar 14, 2024 at 9:38 AM Louis Lacombe @.***> wrote:

Hey @inti4digbi https://github.com/inti4digbi,

Thank you for this issue provided. Indeed, I don't think this is something that could be fixed within our framework. I believe that you could use a class wrapper and re-model the predict function to output only the prediction (checkout #340 https://github.com/scikit-learn-contrib/MAPIE/issues/340).

Also, I am not an expert with regard to CatBoost and using the loss_function with RMSEWithUncertainty but if you only plan to use the prediction and not the uncertainty estimates, is that not simply using the loss_function with RMSE? Do not hesitate to ask if you have any further questions,

Thank you!

— Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/MAPIE/issues/423#issuecomment-1997362621, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWUFMWHEXZTI2BQ4RHP6GYDYYGK6BAVCNFSM6AAAAABEM2BFQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGM3DENRSGE . You are receiving this because you were mentioned.Message ID: @.***>